Homework 1: The German Credit data.

Description: The German credit databset contains 1000 of past credit applicants, described by 30 variables. The applicants are described as Good Credit or Bad Credit: Therefore, the response variable, studied, is the credit rating.

Response variable: RESPONSE in the dataset 0 -> Bad credit 1 -> Good credit

All the other observations are features of the applicants that are going to be studied in order to better classify the new applicants. Our goal is to implement the best classifying model to determine the credit rating as accurate as possible. To do that, we will go through many machine learning models and deploy the CRISP-DM model to solve this problem.

Report structure:

1/ Data Cleaning: Sorting the dataset to make it ready for the analysis

2/ Exploratory Data Analysis; EDA: Studing the dataset and the variables, one by one, plus a PCA

3/ Analysis: a) Splitting the dataset b) Balancing the data c) Fitting the models d) Accuracy study (scoring) e) Variable selection and importance f) Cross-validation / Bootstrap g) Final Best model

4/ Conclusion: is there a accurate model that can determine the credit rating of new applicants ? If Yes which one ? Is there some limitations in this model?

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Le chargement a nécessité le package : lattice
## Le chargement a nécessité le package : survival
## Le chargement a nécessité le package : Formula
## 
## Attachement du package : 'Hmisc'
## Les objets suivants sont masqués depuis 'package:dplyr':
## 
##     src, summarize
## Les objets suivants sont masqués depuis 'package:base':
## 
##     format.pval, units
## 
## Attachement du package : 'psych'
## L'objet suivant est masqué depuis 'package:Hmisc':
## 
##     describe
## Les objets suivants sont masqués depuis 'package:ggplot2':
## 
##     %+%, alpha
## 
## Attachement du package : 'gridExtra'
## L'objet suivant est masqué depuis 'package:dplyr':
## 
##     combine
## 
## Attachement du package : 'summarytools'
## Les objets suivants sont masqués depuis 'package:Hmisc':
## 
##     label, label<-
## L'objet suivant est masqué depuis 'package:tibble':
## 
##     view
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Le chargement a nécessité le package : caret
## 
## Attachement du package : 'caret'
## L'objet suivant est masqué depuis 'package:survival':
## 
##     cluster
## L'objet suivant est masqué depuis 'package:purrr':
## 
##     lift
## Le chargement a nécessité le package : foreach
## 
## Attachement du package : 'foreach'
## Les objets suivants sont masqués depuis 'package:purrr':
## 
##     accumulate, when
## Le chargement a nécessité le package : doParallel
## Le chargement a nécessité le package : iterators
## Le chargement a nécessité le package : parallel
## 
## Attachement du package : 'e1071'
## L'objet suivant est masqué depuis 'package:Hmisc':
## 
##     impute
## Welcome to DALEX (version: 2.4.0).
## Find examples and detailed introduction at: http://ema.drwhy.ai/
## 
## Attachement du package : 'DALEX'
## L'objet suivant est masqué depuis 'package:dplyr':
## 
##     explain
## 
## Attachement du package : 'vip'
## L'objet suivant est masqué depuis 'package:utils':
## 
##     vi
## 
## Attachement du package : 'xgboost'
## L'objet suivant est masqué depuis 'package:dplyr':
## 
##     slice
## 
## Attachement du package : 'Matrix'
## Les objets suivants sont masqués depuis 'package:tidyr':
## 
##     expand, pack, unpack

In this section, we will proceed to an exploratory data analysis of the German Credit data.

Let’s start by importing the dataset.

German_credit <- read.csv("./../Data_DA/GermanCredit.csv", header = TRUE, sep = ";")

Get to know the data

Title : german credit data

Name of the file : GermanCredit.cvs

Abstract

The German Credit data has data on 1000 past credit applicants, described by 30 variables. Each applicant is rated as “Good” or “Bad” credit (encoded as 1 and 0 respectively in the response variable).

Goal : We want to obtain a model that may be used to determine if new applicants present a good or bad credit risk

  • Number of instances : 1000
  • Number of attributes : 30
  • Attribute Information :
str(German_credit)
## 'data.frame':    1000 obs. of  32 variables:
##  $ OBS.            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ CHK_ACCT        : int  0 1 3 0 0 3 3 1 3 1 ...
##  $ DURATION        : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ HISTORY         : int  4 2 4 2 3 2 2 2 2 4 ...
##  $ NEW_CAR         : int  0 0 0 0 1 0 0 0 0 1 ...
##  $ USED_CAR        : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ FURNITURE       : int  0 0 0 1 0 0 1 0 0 0 ...
##  $ RADIO.TV        : int  1 1 0 0 0 0 0 0 1 0 ...
##  $ EDUCATION       : int  0 0 1 0 0 1 0 0 0 0 ...
##  $ RETRAINING      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AMOUNT          : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ SAV_ACCT        : int  4 0 0 0 0 4 2 0 3 0 ...
##  $ EMPLOYMENT      : int  4 2 3 3 2 2 4 2 3 0 ...
##  $ INSTALL_RATE    : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ MALE_DIV        : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ MALE_SINGLE     : int  1 0 1 1 1 1 1 1 0 0 ...
##  $ MALE_MAR_or_WID : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ CO.APPLICANT    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GUARANTOR       : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ PRESENT_RESIDENT: int  4 2 3 4 4 4 4 2 4 2 ...
##  $ REAL_ESTATE     : int  1 1 1 0 0 0 0 0 1 0 ...
##  $ PROP_UNKN_NONE  : int  0 0 0 0 1 1 0 0 0 0 ...
##  $ AGE             : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ OTHER_INSTALL   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RENT            : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ OWN_RES         : int  1 1 1 0 0 0 1 0 1 1 ...
##  $ NUM_CREDITS     : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ JOB             : int  2 2 1 2 2 1 2 3 1 3 ...
##  $ NUM_DEPENDENTS  : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ TELEPHONE       : int  1 0 0 0 0 1 0 1 0 0 ...
##  $ FOREIGN         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RESPONSE        : int  1 0 1 1 0 1 1 1 1 0 ...
summary(German_credit)
##       OBS.           CHK_ACCT        DURATION       HISTORY     
##  Min.   :   1.0   Min.   :0.000   Min.   : 4.0   Min.   :0.000  
##  1st Qu.: 250.8   1st Qu.:0.000   1st Qu.:12.0   1st Qu.:2.000  
##  Median : 500.5   Median :1.000   Median :18.0   Median :2.000  
##  Mean   : 500.5   Mean   :1.577   Mean   :20.9   Mean   :2.545  
##  3rd Qu.: 750.2   3rd Qu.:3.000   3rd Qu.:24.0   3rd Qu.:4.000  
##  Max.   :1000.0   Max.   :3.000   Max.   :72.0   Max.   :4.000  
##     NEW_CAR         USED_CAR       FURNITURE        RADIO.TV   
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.00  
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.00  
##  Median :0.000   Median :0.000   Median :0.000   Median :0.00  
##  Mean   :0.234   Mean   :0.103   Mean   :0.181   Mean   :0.28  
##  3rd Qu.:0.000   3rd Qu.:0.000   3rd Qu.:0.000   3rd Qu.:1.00  
##  Max.   :1.000   Max.   :1.000   Max.   :1.000   Max.   :1.00  
##    EDUCATION        RETRAINING        AMOUNT         SAV_ACCT    
##  Min.   :-1.000   Min.   :0.000   Min.   :  250   Min.   :0.000  
##  1st Qu.: 0.000   1st Qu.:0.000   1st Qu.: 1366   1st Qu.:0.000  
##  Median : 0.000   Median :0.000   Median : 2320   Median :0.000  
##  Mean   : 0.048   Mean   :0.097   Mean   : 3271   Mean   :1.105  
##  3rd Qu.: 0.000   3rd Qu.:0.000   3rd Qu.: 3972   3rd Qu.:2.000  
##  Max.   : 1.000   Max.   :1.000   Max.   :18424   Max.   :4.000  
##    EMPLOYMENT     INSTALL_RATE      MALE_DIV     MALE_SINGLE    MALE_MAR_or_WID
##  Min.   :0.000   Min.   :1.000   Min.   :0.00   Min.   :0.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:0.00   1st Qu.:0.000   1st Qu.:0.000  
##  Median :2.000   Median :3.000   Median :0.00   Median :1.000   Median :0.000  
##  Mean   :2.384   Mean   :2.973   Mean   :0.05   Mean   :0.548   Mean   :0.092  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:0.00   3rd Qu.:1.000   3rd Qu.:0.000  
##  Max.   :4.000   Max.   :4.000   Max.   :1.00   Max.   :1.000   Max.   :1.000  
##   CO.APPLICANT     GUARANTOR     PRESENT_RESIDENT  REAL_ESTATE   
##  Min.   :0.000   Min.   :0.000   Min.   :1.000    Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:2.000    1st Qu.:0.000  
##  Median :0.000   Median :0.000   Median :3.000    Median :0.000  
##  Mean   :0.041   Mean   :0.053   Mean   :2.845    Mean   :0.282  
##  3rd Qu.:0.000   3rd Qu.:0.000   3rd Qu.:4.000    3rd Qu.:1.000  
##  Max.   :1.000   Max.   :2.000   Max.   :4.000    Max.   :1.000  
##  PROP_UNKN_NONE       AGE        OTHER_INSTALL        RENT      
##  Min.   :0.000   Min.   : 19.0   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.: 27.0   1st Qu.:0.000   1st Qu.:0.000  
##  Median :0.000   Median : 33.0   Median :0.000   Median :0.000  
##  Mean   :0.154   Mean   : 35.6   Mean   :0.186   Mean   :0.179  
##  3rd Qu.:0.000   3rd Qu.: 42.0   3rd Qu.:0.000   3rd Qu.:0.000  
##  Max.   :1.000   Max.   :125.0   Max.   :1.000   Max.   :1.000  
##     OWN_RES       NUM_CREDITS         JOB        NUM_DEPENDENTS 
##  Min.   :0.000   Min.   :1.000   Min.   :0.000   Min.   :1.000  
##  1st Qu.:0.000   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :2.000   Median :1.000  
##  Mean   :0.713   Mean   :1.407   Mean   :1.904   Mean   :1.155  
##  3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:1.000  
##  Max.   :1.000   Max.   :4.000   Max.   :3.000   Max.   :2.000  
##    TELEPHONE        FOREIGN         RESPONSE  
##  Min.   :0.000   Min.   :0.000   Min.   :0.0  
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.0  
##  Median :0.000   Median :0.000   Median :1.0  
##  Mean   :0.404   Mean   :0.037   Mean   :0.7  
##  3rd Qu.:1.000   3rd Qu.:0.000   3rd Qu.:1.0  
##  Max.   :1.000   Max.   :1.000   Max.   :1.0

We notice that the variable EDUCATION has a minimum value of ‘-1’ but it should be ‘0’ since there are only 2 levels (0 and 1). Indeed, the observation 37 indicate a value of ‘-1’ for EDUCATION. We notice another strange value, in the variable GUARANTOR, the maximum value is of ‘2’ while it does not mean anything in our data set.

After discussion with the Banker, he gave us the correct values to these 2 mistakes. Observation 37 of EDUCATION and observation 234 of GUARANTOR should be equal to 1.

German_credit$EDUCATION[37] <- 1
German_credit$EDUCATION <- as.factor(German_credit$EDUCATION)

German_credit$GUARANTOR[234] <- 1
German_credit$GUARANTOR <- as.factor(German_credit$GUARANTOR)

We see that the variable AGE has a maximum of 125. This is strange because it is very unlikely that someone lives to the age of 125. We talked to the banker and he confirmed our doubts by telling us that a mistake has been made. At the observation 537, the correct age of the client is 75 years old. He asks us to correct this value in our data set.

German_credit$AGE[537] <- 75
  • There are no missing values.
which(is.na(German_credit))
## integer(0)
  • The response variable is the ‘Response’ variable - last column on the data.

Response variable : credit rating is good

  1. 0 : No

  2. 1 : Yes

We have to make sure that the class of the variables are correct. As described above, all the variables are defined as integer but we know that we should have numerical and categorical variables in our dataset. Therefore, we have to transform the class of some of them.

German_credit$DURATION <- as.numeric(German_credit$DURATION)
German_credit$AMOUNT <- as.numeric(German_credit$AMOUNT)
German_credit$INSTALL_RATE <- as.numeric(German_credit$INSTALL_RATE)
German_credit$AGE <- as.numeric(German_credit$AGE)
German_credit$NUM_CREDITS <- as.numeric(German_credit$NUM_CREDITS)
German_credit$NUM_DEPENDENTS <- as.numeric(German_credit$NUM_DEPENDENTS)

for (i in 1:ncol(German_credit)){
  if (class(German_credit[,i])=="integer"){
    German_credit[,i] <- factor(German_credit[,i])
    }
}

str(German_credit)
## 'data.frame':    1000 obs. of  32 variables:
##  $ OBS.            : Factor w/ 1000 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ CHK_ACCT        : Factor w/ 4 levels "0","1","2","3": 1 2 4 1 1 4 4 2 4 2 ...
##  $ DURATION        : num  6 48 12 42 24 36 24 36 12 30 ...
##  $ HISTORY         : Factor w/ 5 levels "0","1","2","3",..: 5 3 5 3 4 3 3 3 3 5 ...
##  $ NEW_CAR         : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 2 ...
##  $ USED_CAR        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ FURNITURE       : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 1 1 ...
##  $ RADIO.TV        : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 2 1 ...
##  $ EDUCATION       : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 1 1 1 ...
##  $ RETRAINING      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ AMOUNT          : num  1169 5951 2096 7882 4870 ...
##  $ SAV_ACCT        : Factor w/ 5 levels "0","1","2","3",..: 5 1 1 1 1 5 3 1 4 1 ...
##  $ EMPLOYMENT      : Factor w/ 5 levels "0","1","2","3",..: 5 3 4 4 3 3 5 3 4 1 ...
##  $ INSTALL_RATE    : num  4 2 2 2 3 2 3 2 2 4 ...
##  $ MALE_DIV        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
##  $ MALE_SINGLE     : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 1 1 ...
##  $ MALE_MAR_or_WID : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
##  $ CO.APPLICANT    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ GUARANTOR       : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
##  $ PRESENT_RESIDENT: Factor w/ 4 levels "1","2","3","4": 4 2 3 4 4 4 4 2 4 2 ...
##  $ REAL_ESTATE     : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 2 1 ...
##  $ PROP_UNKN_NONE  : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 1 1 1 ...
##  $ AGE             : num  67 22 49 45 53 35 53 35 61 28 ...
##  $ OTHER_INSTALL   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ RENT            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ OWN_RES         : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 1 2 2 ...
##  $ NUM_CREDITS     : num  2 1 1 1 2 1 1 1 1 2 ...
##  $ JOB             : Factor w/ 4 levels "0","1","2","3": 3 3 2 3 3 2 3 4 2 4 ...
##  $ NUM_DEPENDENTS  : num  1 1 2 2 2 2 1 1 1 1 ...
##  $ TELEPHONE       : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 2 1 1 ...
##  $ FOREIGN         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ RESPONSE        : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...

The binomial data are set as factors and the others as numerical.

We can now describe the variables one more time and we should get better results.

describe(German_credit)
##                   vars    n    mean      sd median trimmed     mad min   max
## OBS.*                1 1000  500.50  288.82  500.5  500.50  370.65   1  1000
## CHK_ACCT*            2 1000    2.58    1.26    2.0    2.60    1.48   1     4
## DURATION             3 1000   20.90   12.06   18.0   19.47    8.90   4    72
## HISTORY*             4 1000    3.54    1.08    3.0    3.59    0.00   1     5
## NEW_CAR*             5 1000    1.23    0.42    1.0    1.17    0.00   1     2
## USED_CAR*            6 1000    1.10    0.30    1.0    1.00    0.00   1     2
## FURNITURE*           7 1000    1.18    0.39    1.0    1.10    0.00   1     2
## RADIO.TV*            8 1000    1.28    0.45    1.0    1.23    0.00   1     2
## EDUCATION*           9 1000    1.05    0.22    1.0    1.00    0.00   1     2
## RETRAINING*         10 1000    1.10    0.30    1.0    1.00    0.00   1     2
## AMOUNT              11 1000 3271.26 2822.74 2319.5 2754.57 1627.15 250 18424
## SAV_ACCT*           12 1000    2.10    1.58    1.0    1.88    0.00   1     5
## EMPLOYMENT*         13 1000    3.38    1.21    3.0    3.43    1.48   1     5
## INSTALL_RATE        14 1000    2.97    1.12    3.0    3.09    1.48   1     4
## MALE_DIV*           15 1000    1.05    0.22    1.0    1.00    0.00   1     2
## MALE_SINGLE*        16 1000    1.55    0.50    2.0    1.56    0.00   1     2
## MALE_MAR_or_WID*    17 1000    1.09    0.29    1.0    1.00    0.00   1     2
## CO.APPLICANT*       18 1000    1.04    0.20    1.0    1.00    0.00   1     2
## GUARANTOR*          19 1000    1.05    0.22    1.0    1.00    0.00   1     2
## PRESENT_RESIDENT*   20 1000    2.85    1.10    3.0    2.93    1.48   1     4
## REAL_ESTATE*        21 1000    1.28    0.45    1.0    1.23    0.00   1     2
## PROP_UNKN_NONE*     22 1000    1.15    0.36    1.0    1.07    0.00   1     2
## AGE                 23 1000   35.55   11.38   33.0   34.17   10.38  19    75
## OTHER_INSTALL*      24 1000    1.19    0.39    1.0    1.11    0.00   1     2
## RENT*               25 1000    1.18    0.38    1.0    1.10    0.00   1     2
## OWN_RES*            26 1000    1.71    0.45    2.0    1.77    0.00   1     2
## NUM_CREDITS         27 1000    1.41    0.58    1.0    1.33    0.00   1     4
## JOB*                28 1000    2.90    0.65    3.0    2.91    0.00   1     4
## NUM_DEPENDENTS      29 1000    1.16    0.36    1.0    1.07    0.00   1     2
## TELEPHONE*          30 1000    1.40    0.49    1.0    1.38    0.00   1     2
## FOREIGN*            31 1000    1.04    0.19    1.0    1.00    0.00   1     2
## RESPONSE*           32 1000    1.70    0.46    2.0    1.75    0.00   1     2
##                   range  skew kurtosis    se
## OBS.*               999  0.00    -1.20  9.13
## CHK_ACCT*             3  0.01    -1.66  0.04
## DURATION             68  1.09     0.90  0.38
## HISTORY*              4 -0.01    -0.59  0.03
## NEW_CAR*              1  1.25    -0.43  0.01
## USED_CAR*             1  2.61     4.81  0.01
## FURNITURE*            1  1.65     0.74  0.01
## RADIO.TV*             1  0.98    -1.04  0.01
## EDUCATION*            1  4.12    15.02  0.01
## RETRAINING*           1  2.72     5.40  0.01
## AMOUNT            18174  1.94     4.25 89.26
## SAV_ACCT*             4  1.01    -0.69  0.05
## EMPLOYMENT*           4 -0.12    -0.94  0.04
## INSTALL_RATE          3 -0.53    -1.21  0.04
## MALE_DIV*             1  4.12    15.02  0.01
## MALE_SINGLE*          1 -0.19    -1.96  0.02
## MALE_MAR_or_WID*      1  2.82     5.95  0.01
## CO.APPLICANT*         1  4.62    19.39  0.01
## GUARANTOR*            1  4.03    14.25  0.01
## PRESENT_RESIDENT*     3 -0.27    -1.38  0.03
## REAL_ESTATE*          1  0.97    -1.07  0.01
## PROP_UNKN_NONE*       1  1.91     1.67  0.01
## AGE                  56  1.02     0.58  0.36
## OTHER_INSTALL*        1  1.61     0.60  0.01
## RENT*                 1  1.67     0.80  0.01
## OWN_RES*              1 -0.94    -1.12  0.01
## NUM_CREDITS           3  1.27     1.58  0.02
## JOB*                  3 -0.37     0.49  0.02
## NUM_DEPENDENTS        1  1.90     1.63  0.01
## TELEPHONE*            1  0.39    -1.85  0.02
## FOREIGN*              1  4.90    22.02  0.01
## RESPONSE*             1 -0.87    -1.24  0.01
introduce(German_credit)
##   rows columns discrete_columns continuous_columns all_missing_columns
## 1 1000      32               26                  6                   0
##   total_missing_values complete_rows total_observations memory_usage
## 1                    0          1000              32000       237424
plot_intro(German_credit)

The plot helps us to see the percentage of continuous variable, the percentage of discrete variables and whether or not some observations are missing.

Visualization of the data

First, we plot all the continuous variables into histograms and their corresponding density plots.

plot_histogram(German_credit)

plot_density(German_credit)

Our first notice is that the data are not really normally distributed. Some of them are right-tailed.

We can look at the tails and outliers more carefully through boxplots.

plot_boxplot(German_credit, by = 'RESPONSE',  ncol = 2,
             title = "Side-by-side boxplots")

This seems not to be disturbing. It makes sense that only a few people has a big credit amount. However it seems that the ‘bad’ clients tends to ask for bigger credit amount than ‘good’ clients.

Now, we can make some barplots of the categorical variables.

for (i in 1:ncol(German_credit)){
  if (class(German_credit[,i])=="factor"){
    print(ggplot(German_credit) +
            geom_bar(aes(x=German_credit[,i]), fill = "coral") +
            ggtitle(paste("Barplot of", colnames(German_credit)[i])) +
            labs(x = paste(colnames(German_credit)[i])))
}}

From those barplots we can see:

A general summary can be done.

dfSummary(German_credit, style = 'grid')
## Data Frame Summary  
## Dimensions: 1000 x 32  
## Duplicates: 0  
## 
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | No | Variable         | Stats / Values              | Freqs (% of Valid)  | Graph                 | Valid    | Missing |
## +====+==================+=============================+=====================+=======================+==========+=========+
## | 1  | OBS.             | 1. 1                        |   1 ( 0.1%)         |                       | 1000     | 0       |
## |    | [factor]         | 2. 2                        |   1 ( 0.1%)         |                       | (100.0%) | (0.0%)  |
## |    |                  | 3. 3                        |   1 ( 0.1%)         |                       |          |         |
## |    |                  | 4. 4                        |   1 ( 0.1%)         |                       |          |         |
## |    |                  | 5. 5                        |   1 ( 0.1%)         |                       |          |         |
## |    |                  | 6. 6                        |   1 ( 0.1%)         |                       |          |         |
## |    |                  | 7. 7                        |   1 ( 0.1%)         |                       |          |         |
## |    |                  | 8. 8                        |   1 ( 0.1%)         |                       |          |         |
## |    |                  | 9. 9                        |   1 ( 0.1%)         |                       |          |         |
## |    |                  | 10. 10                      |   1 ( 0.1%)         |                       |          |         |
## |    |                  | [ 990 others ]              | 990 (99.0%)         | IIIIIIIIIIIIIIIIIII   |          |         |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 2  | CHK_ACCT         | 1. 0                        | 274 (27.4%)         | IIIII                 | 1000     | 0       |
## |    | [factor]         | 2. 1                        | 269 (26.9%)         | IIIII                 | (100.0%) | (0.0%)  |
## |    |                  | 3. 2                        |  63 ( 6.3%)         | I                     |          |         |
## |    |                  | 4. 3                        | 394 (39.4%)         | IIIIIII               |          |         |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 3  | DURATION         | Mean (sd) : 20.9 (12.1)     | 33 distinct values  |     :                 | 1000     | 0       |
## |    | [numeric]        | min < med < max:            |                     |   : :                 | (100.0%) | (0.0%)  |
## |    |                  | 4 < 18 < 72                 |                     | . : :                 |          |         |
## |    |                  | IQR (CV) : 12 (0.6)         |                     | : : :   .             |          |         |
## |    |                  |                             |                     | : : : : : . :         |          |         |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 4  | HISTORY          | 1. 0                        |  40 ( 4.0%)         |                       | 1000     | 0       |
## |    | [factor]         | 2. 1                        |  49 ( 4.9%)         |                       | (100.0%) | (0.0%)  |
## |    |                  | 3. 2                        | 530 (53.0%)         | IIIIIIIIII            |          |         |
## |    |                  | 4. 3                        |  88 ( 8.8%)         | I                     |          |         |
## |    |                  | 5. 4                        | 293 (29.3%)         | IIIII                 |          |         |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 5  | NEW_CAR          | 1. 0                        | 766 (76.6%)         | IIIIIIIIIIIIIII       | 1000     | 0       |
## |    | [factor]         | 2. 1                        | 234 (23.4%)         | IIII                  | (100.0%) | (0.0%)  |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 6  | USED_CAR         | 1. 0                        | 897 (89.7%)         | IIIIIIIIIIIIIIIII     | 1000     | 0       |
## |    | [factor]         | 2. 1                        | 103 (10.3%)         | II                    | (100.0%) | (0.0%)  |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 7  | FURNITURE        | 1. 0                        | 819 (81.9%)         | IIIIIIIIIIIIIIII      | 1000     | 0       |
## |    | [factor]         | 2. 1                        | 181 (18.1%)         | III                   | (100.0%) | (0.0%)  |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 8  | RADIO.TV         | 1. 0                        | 720 (72.0%)         | IIIIIIIIIIIIII        | 1000     | 0       |
## |    | [factor]         | 2. 1                        | 280 (28.0%)         | IIIII                 | (100.0%) | (0.0%)  |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 9  | EDUCATION        | 1. 0                        | 950 (95.0%)         | IIIIIIIIIIIIIIIIIII   | 1000     | 0       |
## |    | [factor]         | 2. 1                        |  50 ( 5.0%)         | I                     | (100.0%) | (0.0%)  |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 10 | RETRAINING       | 1. 0                        | 903 (90.3%)         | IIIIIIIIIIIIIIIIII    | 1000     | 0       |
## |    | [factor]         | 2. 1                        |  97 ( 9.7%)         | I                     | (100.0%) | (0.0%)  |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 11 | AMOUNT           | Mean (sd) : 3271.3 (2822.7) | 921 distinct values | :                     | 1000     | 0       |
## |    | [numeric]        | min < med < max:            |                     | : .                   | (100.0%) | (0.0%)  |
## |    |                  | 250 < 2319.5 < 18424        |                     | : :                   |          |         |
## |    |                  | IQR (CV) : 2606.8 (0.9)     |                     | : :                   |          |         |
## |    |                  |                             |                     | : : : : .             |          |         |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 12 | SAV_ACCT         | 1. 0                        | 603 (60.3%)         | IIIIIIIIIIII          | 1000     | 0       |
## |    | [factor]         | 2. 1                        | 103 (10.3%)         | II                    | (100.0%) | (0.0%)  |
## |    |                  | 3. 2                        |  63 ( 6.3%)         | I                     |          |         |
## |    |                  | 4. 3                        |  48 ( 4.8%)         |                       |          |         |
## |    |                  | 5. 4                        | 183 (18.3%)         | III                   |          |         |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 13 | EMPLOYMENT       | 1. 0                        |  62 ( 6.2%)         | I                     | 1000     | 0       |
## |    | [factor]         | 2. 1                        | 172 (17.2%)         | III                   | (100.0%) | (0.0%)  |
## |    |                  | 3. 2                        | 339 (33.9%)         | IIIIII                |          |         |
## |    |                  | 4. 3                        | 174 (17.4%)         | III                   |          |         |
## |    |                  | 5. 4                        | 253 (25.3%)         | IIIII                 |          |         |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 14 | INSTALL_RATE     | Mean (sd) : 3 (1.1)         | 1 : 136 (13.6%)     | II                    | 1000     | 0       |
## |    | [numeric]        | min < med < max:            | 2 : 231 (23.1%)     | IIII                  | (100.0%) | (0.0%)  |
## |    |                  | 1 < 3 < 4                   | 3 : 157 (15.7%)     | III                   |          |         |
## |    |                  | IQR (CV) : 2 (0.4)          | 4 : 476 (47.6%)     | IIIIIIIII             |          |         |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 15 | MALE_DIV         | 1. 0                        | 950 (95.0%)         | IIIIIIIIIIIIIIIIIII   | 1000     | 0       |
## |    | [factor]         | 2. 1                        |  50 ( 5.0%)         | I                     | (100.0%) | (0.0%)  |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 16 | MALE_SINGLE      | 1. 0                        | 452 (45.2%)         | IIIIIIIII             | 1000     | 0       |
## |    | [factor]         | 2. 1                        | 548 (54.8%)         | IIIIIIIIII            | (100.0%) | (0.0%)  |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 17 | MALE_MAR_or_WID  | 1. 0                        | 908 (90.8%)         | IIIIIIIIIIIIIIIIII    | 1000     | 0       |
## |    | [factor]         | 2. 1                        |  92 ( 9.2%)         | I                     | (100.0%) | (0.0%)  |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 18 | CO.APPLICANT     | 1. 0                        | 959 (95.9%)         | IIIIIIIIIIIIIIIIIII   | 1000     | 0       |
## |    | [factor]         | 2. 1                        |  41 ( 4.1%)         |                       | (100.0%) | (0.0%)  |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 19 | GUARANTOR        | 1. 0                        | 948 (94.8%)         | IIIIIIIIIIIIIIIIII    | 1000     | 0       |
## |    | [factor]         | 2. 1                        |  52 ( 5.2%)         | I                     | (100.0%) | (0.0%)  |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 20 | PRESENT_RESIDENT | 1. 1                        | 130 (13.0%)         | II                    | 1000     | 0       |
## |    | [factor]         | 2. 2                        | 308 (30.8%)         | IIIIII                | (100.0%) | (0.0%)  |
## |    |                  | 3. 3                        | 149 (14.9%)         | II                    |          |         |
## |    |                  | 4. 4                        | 413 (41.3%)         | IIIIIIII              |          |         |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 21 | REAL_ESTATE      | 1. 0                        | 718 (71.8%)         | IIIIIIIIIIIIII        | 1000     | 0       |
## |    | [factor]         | 2. 1                        | 282 (28.2%)         | IIIII                 | (100.0%) | (0.0%)  |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 22 | PROP_UNKN_NONE   | 1. 0                        | 846 (84.6%)         | IIIIIIIIIIIIIIII      | 1000     | 0       |
## |    | [factor]         | 2. 1                        | 154 (15.4%)         | III                   | (100.0%) | (0.0%)  |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 23 | AGE              | Mean (sd) : 35.5 (11.4)     | 53 distinct values  |   :                   | 1000     | 0       |
## |    | [numeric]        | min < med < max:            |                     |   : .                 | (100.0%) | (0.0%)  |
## |    |                  | 19 < 33 < 75                |                     | : : : :               |          |         |
## |    |                  | IQR (CV) : 15 (0.3)         |                     | : : : : :             |          |         |
## |    |                  |                             |                     | : : : : : : : . .     |          |         |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 24 | OTHER_INSTALL    | 1. 0                        | 814 (81.4%)         | IIIIIIIIIIIIIIII      | 1000     | 0       |
## |    | [factor]         | 2. 1                        | 186 (18.6%)         | III                   | (100.0%) | (0.0%)  |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 25 | RENT             | 1. 0                        | 821 (82.1%)         | IIIIIIIIIIIIIIII      | 1000     | 0       |
## |    | [factor]         | 2. 1                        | 179 (17.9%)         | III                   | (100.0%) | (0.0%)  |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 26 | OWN_RES          | 1. 0                        | 287 (28.7%)         | IIIII                 | 1000     | 0       |
## |    | [factor]         | 2. 1                        | 713 (71.3%)         | IIIIIIIIIIIIII        | (100.0%) | (0.0%)  |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 27 | NUM_CREDITS      | Mean (sd) : 1.4 (0.6)       | 1 : 633 (63.3%)     | IIIIIIIIIIII          | 1000     | 0       |
## |    | [numeric]        | min < med < max:            | 2 : 333 (33.3%)     | IIIIII                | (100.0%) | (0.0%)  |
## |    |                  | 1 < 1 < 4                   | 3 :  28 ( 2.8%)     |                       |          |         |
## |    |                  | IQR (CV) : 1 (0.4)          | 4 :   6 ( 0.6%)     |                       |          |         |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 28 | JOB              | 1. 0                        |  22 ( 2.2%)         |                       | 1000     | 0       |
## |    | [factor]         | 2. 1                        | 200 (20.0%)         | IIII                  | (100.0%) | (0.0%)  |
## |    |                  | 3. 2                        | 630 (63.0%)         | IIIIIIIIIIII          |          |         |
## |    |                  | 4. 3                        | 148 (14.8%)         | II                    |          |         |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 29 | NUM_DEPENDENTS   | Min  : 1                    | 1 : 845 (84.5%)     | IIIIIIIIIIIIIIII      | 1000     | 0       |
## |    | [numeric]        | Mean : 1.2                  | 2 : 155 (15.5%)     | III                   | (100.0%) | (0.0%)  |
## |    |                  | Max  : 2                    |                     |                       |          |         |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 30 | TELEPHONE        | 1. 0                        | 596 (59.6%)         | IIIIIIIIIII           | 1000     | 0       |
## |    | [factor]         | 2. 1                        | 404 (40.4%)         | IIIIIIII              | (100.0%) | (0.0%)  |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 31 | FOREIGN          | 1. 0                        | 963 (96.3%)         | IIIIIIIIIIIIIIIIIII   | 1000     | 0       |
## |    | [factor]         | 2. 1                        |  37 ( 3.7%)         |                       | (100.0%) | (0.0%)  |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+
## | 32 | RESPONSE         | 1. 0                        | 300 (30.0%)         | IIIIII                | 1000     | 0       |
## |    | [factor]         | 2. 1                        | 700 (70.0%)         | IIIIIIIIIIIIII        | (100.0%) | (0.0%)  |
## +----+------------------+-----------------------------+---------------------+-----------------------+----------+---------+

Correlation plot :

Correlation plot between continuous variables :

plot_correlation(German_credit, type= 'c', cor_args = list( 'use' = 'complete.obs'))

There are little correlation between the continuous variables. We can notice that there is a correlation of 62% between the variable DURATION and AMOOUNT. This makes sense and is known by the bankers that the bigger the amount of credit, the longer the duration of the credit in months will last.

Correlation plot between categorical variables :

plot_correlation(German_credit, type= 'd')
## 1 features with more than 20 categories ignored!
## OBS.: 1000 categories

It is difficult to look at the correlation since there are a lot of variables on the graph. We can still see higher correlation between RESPONSE 1:

  • and people that do not check their account (CHK_ACCT_3)
  • and people that have a critical historical account (HISTORY_4)
  • and the variable REAL_ESTATE (REAL_ESTATE)
  • and applicant that does not have their own property (PROP_UNKN_NONE_0)
  • and applicant that have their own residence (OWN_RES_1)

We can also see some correlation between RESPONSE 0:

  • and people that have a checking account status < 0 DM (CHK_ACCT_0)
  • and people that have an average balance in savings account < 100 DM (SAV_ACCT_0)
  • and the variable REAL_ESTATE (REAL_ESTATE)

PCA Exploration:

It is good to perform a PCA Exploration in order to reduce the dimensions or/and see which variables to select.

We start by selecting the numerical values:

German_credit.num <- German_credit %>% 
  select('DURATION', 'AMOUNT', 'INSTALL_RATE', 'AGE', 'NUM_CREDITS','NUM_DEPENDENTS')
German_credit.pca <- prcomp(German_credit.num, center = TRUE, scale = TRUE)
summary(German_credit.pca)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6
## Standard deviation     1.2873 1.1208 1.0443 0.9318 0.9193 0.53164
## Proportion of Variance 0.2762 0.2094 0.1818 0.1447 0.1409 0.04711
## Cumulative Proportion  0.2762 0.4856 0.6673 0.8120 0.9529 1.00000

From the PCA summary we can see 4 principal components should be taken into account in order to explain approximately 81% of the variation of the data.

Eigenvalue analysis:

##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  1.6570953        27.618256                    27.61826
## Dim.2  1.2562810        20.938016                    48.55627
## Dim.3  1.0906419        18.177365                    66.73364
## Dim.4  0.8682109        14.470181                    81.20382
## Dim.5  0.8451277        14.085462                    95.28928
## Dim.6  0.2826431         4.710719                   100.00000

Need to choose 3 dimensions according to eigenvalues. Dim1/2/3 > 1

fviz_pca_ind(German_credit.pca,
             col.ind = "cos2", # Colorer par le cos2
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     
             )
## Warning: ggrepel: 926 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

fviz_pca_var(German_credit.pca,
             col.var = "contrib", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     
             )

From this circle of correlations, we see that :

  • The first principal component PC1 is strongly positively correlated with the variables AMOUNT and DURATION. So the larger PC1, the larger these features. It is also a little bit negatively correlated with INSTALL_RATE.
  • The second principal component PC2 is strongly positively correlated with AGE, NUM_DEPENDENTS and NUM_CREDITS.

From this biplot, we can see some characteristics of the observations.

Here, we can see the distribution of the response variables (0-1) according to the reduced dimension. What we can observe, is that the Bad credits: 0, look a little bit more positively correlated ot dimension 1. Therefore, more correlated to Amount and Duration. Compared to Good Credits, it looks positively correlated to dimesion 2; AGE, Num_Credits, Num_dependents.

We can export the dataset as we have made some modifications. It will be easier for the other files.

# write.csv(German_credit,"./../Data_DA/GermanCredit_cleaned.csv", row.names = FALSE)

Import libraries and data

German_credit_cleaned <- read.csv("./../Data_DA/GermanCredit_cleaned.csv", header = TRUE)

German_credit_cleaned$DURATION <- as.numeric(German_credit_cleaned$DURATION)
German_credit_cleaned$AMOUNT <- as.numeric(German_credit_cleaned$AMOUNT)
German_credit_cleaned$INSTALL_RATE <- as.numeric(German_credit_cleaned$INSTALL_RATE)
German_credit_cleaned$AGE <- as.numeric(German_credit_cleaned$AGE)
German_credit_cleaned$NUM_CREDITS <- as.numeric(German_credit_cleaned$NUM_CREDITS)
German_credit_cleaned$NUM_DEPENDENTS <- as.numeric(German_credit_cleaned$NUM_DEPENDENTS)

for (i in 1:ncol(German_credit_cleaned)){
  if (class(German_credit_cleaned[,i])=="integer"){
    German_credit_cleaned[,i] <- factor(German_credit_cleaned[,i])
    }
}

str(German_credit_cleaned)
## 'data.frame':    1000 obs. of  32 variables:
##  $ OBS.            : Factor w/ 1000 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ CHK_ACCT        : Factor w/ 4 levels "0","1","2","3": 1 2 4 1 1 4 4 2 4 2 ...
##  $ DURATION        : num  6 48 12 42 24 36 24 36 12 30 ...
##  $ HISTORY         : Factor w/ 5 levels "0","1","2","3",..: 5 3 5 3 4 3 3 3 3 5 ...
##  $ NEW_CAR         : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 2 ...
##  $ USED_CAR        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ FURNITURE       : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 1 1 ...
##  $ RADIO.TV        : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 2 1 ...
##  $ EDUCATION       : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 1 1 1 ...
##  $ RETRAINING      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ AMOUNT          : num  1169 5951 2096 7882 4870 ...
##  $ SAV_ACCT        : Factor w/ 5 levels "0","1","2","3",..: 5 1 1 1 1 5 3 1 4 1 ...
##  $ EMPLOYMENT      : Factor w/ 5 levels "0","1","2","3",..: 5 3 4 4 3 3 5 3 4 1 ...
##  $ INSTALL_RATE    : num  4 2 2 2 3 2 3 2 2 4 ...
##  $ MALE_DIV        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
##  $ MALE_SINGLE     : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 1 1 ...
##  $ MALE_MAR_or_WID : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
##  $ CO.APPLICANT    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ GUARANTOR       : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
##  $ PRESENT_RESIDENT: Factor w/ 4 levels "1","2","3","4": 4 2 3 4 4 4 4 2 4 2 ...
##  $ REAL_ESTATE     : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 2 1 ...
##  $ PROP_UNKN_NONE  : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 1 1 1 ...
##  $ AGE             : num  67 22 49 45 53 35 53 35 61 28 ...
##  $ OTHER_INSTALL   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ RENT            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ OWN_RES         : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 1 2 2 ...
##  $ NUM_CREDITS     : num  2 1 1 1 2 1 1 1 1 2 ...
##  $ JOB             : Factor w/ 4 levels "0","1","2","3": 3 3 2 3 3 2 3 4 2 4 ...
##  $ NUM_DEPENDENTS  : num  1 1 2 2 2 2 1 1 1 1 ...
##  $ TELEPHONE       : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 2 1 1 ...
##  $ FOREIGN         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ RESPONSE        : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...

Traning set and Test set

In order to fit some model, we are going to split our dataset into 2 sets. The training set and the test set. We do not forget to take the first variable OBS. out as it represents the index number for each observation.

set.seed(1234)
# indexes of the training set observations 
index.tr <- sample(x = 1:nrow(German_credit_cleaned), 
                   size= 3/4*nrow(German_credit_cleaned), replace=FALSE)
# training set 
German_credit.tr <- German_credit_cleaned[index.tr,-1]
# test set 
German_credit.te <- German_credit_cleaned[-index.tr,-1]

Balancing the dataset

In this part, we apply the balancing data technique in order to improve the predictions of “Good Credit” and “Bad Credit”, since we have more observations on the “Bad Credit”.

The table below reveals the unbalanced problem.

## Statistics on the training set
table(German_credit.tr$RESPONSE)
## 
##   0   1 
## 223 527

Indeed, we observe that the “Good Credit” (1) response appears 527 times in the training set and “Bad Credit” (0) 223, two times less. Since there are many more “Good Credit” than “Bad Credit”, any model favors the prediction of the “Good Credit”. It results a good accuracy but the specificity is low, as well as the balanced accuracy.

Sub-sampling

Balancing using sub-sampling consists of taking all the cases in the smallest class (here “Bad Credit”) and extract at random the same amount of cases in the largest category (here “Good”).

n.Bad_Credit <- min(table(German_credit.tr$RESPONSE)) ## 32

## the "0" cases
German_credit.tr.Bad_Credit <- filter(German_credit.tr, RESPONSE == 0) 

## The "1" cases
German_credit.tr.Good_Credit <- filter(German_credit.tr, RESPONSE == 1)

## sub-sample 32 instances from the "Good Credit" by drawing indexes
index.no <- sample(size=n.Bad_Credit, x=1:nrow(German_credit.tr.Good_Credit), replace=FALSE) 

## Bind all the "Bad" and the sub-sampled "Good"
German_Credit.tr.subs <- data.frame(rbind(German_credit.tr.Bad_Credit,
                                          German_credit.tr.Good_Credit[index.no,])) 

## The cases are balanced
table(German_Credit.tr.subs$RESPONSE) 
## 
##   0   1 
## 223 223

The dataset is now balanced.

Fitting a model : Classification Tree (Decision Tree)

Let’s try a classification tree

german.ct <- rpart(RESPONSE ~ ., method = "class", data = German_Credit.tr.subs )
summary(german.ct)
## Call:
## rpart(formula = RESPONSE ~ ., data = German_Credit.tr.subs, method = "class")
##   n= 446 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.39910314      0 1.0000000 1.0852018 0.04717919
## 2 0.02242152      1 0.6008969 0.6367713 0.04411727
## 3 0.01457399      2 0.5784753 0.6681614 0.04466826
## 4 0.01195815      7 0.4843049 0.6905830 0.04502767
## 5 0.01121076     10 0.4484305 0.6726457 0.04474239
## 6 0.01000000     12 0.4260090 0.6726457 0.04474239
## 
## Variable importance
##        CHK_ACCT          AMOUNT         HISTORY        DURATION       GUARANTOR 
##              34              15              14              10               6 
##      EMPLOYMENT        SAV_ACCT             AGE     MALE_SINGLE     NUM_CREDITS 
##               6               4               2               2               2 
##    INSTALL_RATE     REAL_ESTATE MALE_MAR_or_WID 
##               1               1               1 
## 
## Node number 1: 446 observations,    complexity param=0.3991031
##   predicted class=0  expected loss=0.5  P(node) =1
##     class counts:   223   223
##    probabilities: 0.500 0.500 
##   left son=2 (269 obs) right son=3 (177 obs)
##   Primary splits:
##       CHK_ACCT splits as  LLRR,       improve=37.098750, (0 missing)
##       HISTORY  splits as  LLLLR,      improve=12.867960, (0 missing)
##       AMOUNT   < 8962.5 to the right, improve=12.636380, (0 missing)
##       DURATION < 34.5   to the right, improve=11.073260, (0 missing)
##       SAV_ACCT splits as  LLRRR,      improve= 9.538067, (0 missing)
##   Surrogate splits:
##       SAV_ACCT    splits as  LLRRL,      agree=0.648, adj=0.113, (0 split)
##       DURATION    < 10.5   to the right, agree=0.621, adj=0.045, (0 split)
##       HISTORY     splits as  LLLLR,      agree=0.617, adj=0.034, (0 split)
##       AMOUNT      < 440.5  to the right, agree=0.610, adj=0.017, (0 split)
##       REAL_ESTATE splits as  LR,         agree=0.608, adj=0.011, (0 split)
## 
## Node number 2: 269 observations,    complexity param=0.01457399
##   predicted class=0  expected loss=0.3345725  P(node) =0.603139
##     class counts:   179    90
##    probabilities: 0.665 0.335 
##   left son=4 (22 obs) right son=5 (247 obs)
##   Primary splits:
##       AMOUNT         < 9340.5 to the right, improve=5.363996, (0 missing)
##       DURATION       < 27.5   to the right, improve=5.072190, (0 missing)
##       HISTORY        splits as  LLLRR,      improve=5.002720, (0 missing)
##       GUARANTOR      splits as  LR,         improve=3.606846, (0 missing)
##       PROP_UNKN_NONE splits as  RL,         improve=3.532773, (0 missing)
## 
## Node number 3: 177 observations,    complexity param=0.02242152
##   predicted class=1  expected loss=0.2485876  P(node) =0.396861
##     class counts:    44   133
##    probabilities: 0.249 0.751 
##   left son=6 (17 obs) right son=7 (160 obs)
##   Primary splits:
##       HISTORY       splits as  LRRLR,      improve=5.972088, (0 missing)
##       AMOUNT        < 7839.5 to the right, improve=5.310802, (0 missing)
##       EMPLOYMENT    splits as  LLLRR,      improve=4.645266, (0 missing)
##       RETRAINING    splits as  RL,         improve=3.509915, (0 missing)
##       OTHER_INSTALL splits as  RL,         improve=3.216630, (0 missing)
## 
## Node number 4: 22 observations
##   predicted class=0  expected loss=0  P(node) =0.04932735
##     class counts:    22     0
##    probabilities: 1.000 0.000 
## 
## Node number 5: 247 observations,    complexity param=0.01457399
##   predicted class=0  expected loss=0.3643725  P(node) =0.5538117
##     class counts:   157    90
##    probabilities: 0.636 0.364 
##   left son=10 (38 obs) right son=11 (209 obs)
##   Primary splits:
##       HISTORY   splits as  LLRRR,      improve=4.867501, (0 missing)
##       SAV_ACCT  splits as  LLLRR,      improve=3.593650, (0 missing)
##       DURATION  < 27.5   to the right, improve=3.361335, (0 missing)
##       GUARANTOR splits as  LR,         improve=2.939271, (0 missing)
##       USED_CAR  splits as  LR,         improve=2.918703, (0 missing)
## 
## Node number 6: 17 observations
##   predicted class=0  expected loss=0.3529412  P(node) =0.03811659
##     class counts:    11     6
##    probabilities: 0.647 0.353 
## 
## Node number 7: 160 observations,    complexity param=0.01121076
##   predicted class=1  expected loss=0.20625  P(node) =0.3587444
##     class counts:    33   127
##    probabilities: 0.206 0.794 
##   left son=14 (82 obs) right son=15 (78 obs)
##   Primary splits:
##       EMPLOYMENT    splits as  LLLRR,      improve=3.272428, (0 missing)
##       OTHER_INSTALL splits as  RL,         improve=3.098970, (0 missing)
##       AMOUNT        < 7447   to the right, improve=2.953289, (0 missing)
##       CHK_ACCT      splits as  --LR,       improve=1.749795, (0 missing)
##       DURATION      < 34.5   to the right, improve=1.608088, (0 missing)
##   Surrogate splits:
##       MALE_SINGLE    splits as  LR,         agree=0.650, adj=0.282, (0 split)
##       REAL_ESTATE    splits as  RL,         agree=0.613, adj=0.205, (0 split)
##       AGE            < 33.5   to the left,  agree=0.600, adj=0.179, (0 split)
##       NUM_DEPENDENTS < 1.5    to the left,  agree=0.581, adj=0.141, (0 split)
##       INSTALL_RATE   < 3.5    to the left,  agree=0.575, adj=0.128, (0 split)
## 
## Node number 10: 38 observations
##   predicted class=0  expected loss=0.1315789  P(node) =0.08520179
##     class counts:    33     5
##    probabilities: 0.868 0.132 
## 
## Node number 11: 209 observations,    complexity param=0.01457399
##   predicted class=0  expected loss=0.4066986  P(node) =0.4686099
##     class counts:   124    85
##    probabilities: 0.593 0.407 
##   left son=22 (60 obs) right son=23 (149 obs)
##   Primary splits:
##       DURATION   < 27.5   to the right, improve=6.078470, (0 missing)
##       SAV_ACCT   splits as  LLLRR,      improve=3.141704, (0 missing)
##       USED_CAR   splits as  LR,         improve=3.000838, (0 missing)
##       EMPLOYMENT splits as  LLRRR,      improve=2.828791, (0 missing)
##       AGE        < 25.5   to the left,  improve=2.823140, (0 missing)
##   Surrogate splits:
##       AMOUNT         < 4201   to the right, agree=0.804, adj=0.317, (0 split)
##       PROP_UNKN_NONE splits as  RL,         agree=0.732, adj=0.067, (0 split)
##       EDUCATION      splits as  RL,         agree=0.718, adj=0.017, (0 split)
## 
## Node number 14: 82 observations,    complexity param=0.01121076
##   predicted class=1  expected loss=0.304878  P(node) =0.1838565
##     class counts:    25    57
##    probabilities: 0.305 0.695 
##   left son=28 (7 obs) right son=29 (75 obs)
##   Primary splits:
##       AMOUNT        < 6827   to the right, improve=4.668479, (0 missing)
##       DURATION      < 25.5   to the right, improve=4.325542, (0 missing)
##       JOB           splits as  LRRL,       improve=3.679907, (0 missing)
##       OTHER_INSTALL splits as  RL,         improve=2.398955, (0 missing)
##       AGE           < 26.5   to the left,  improve=1.921904, (0 missing)
##   Surrogate splits:
##       GUARANTOR splits as  RL, agree=0.927, adj=0.143, (0 split)
## 
## Node number 15: 78 observations
##   predicted class=1  expected loss=0.1025641  P(node) =0.1748879
##     class counts:     8    70
##    probabilities: 0.103 0.897 
## 
## Node number 22: 60 observations
##   predicted class=0  expected loss=0.2166667  P(node) =0.1345291
##     class counts:    47    13
##    probabilities: 0.783 0.217 
## 
## Node number 23: 149 observations,    complexity param=0.01457399
##   predicted class=0  expected loss=0.4832215  P(node) =0.3340807
##     class counts:    77    72
##    probabilities: 0.517 0.483 
##   left son=46 (94 obs) right son=47 (55 obs)
##   Primary splits:
##       HISTORY     splits as  --LRR, improve=3.175875, (0 missing)
##       EMPLOYMENT  splits as  LLRRR, improve=2.996737, (0 missing)
##       MALE_SINGLE splits as  LR,    improve=2.747980, (0 missing)
##       GUARANTOR   splits as  LR,    improve=2.665119, (0 missing)
##       USED_CAR    splits as  LR,    improve=2.595185, (0 missing)
##   Surrogate splits:
##       NUM_CREDITS < 1.5    to the left,  agree=0.846, adj=0.582, (0 split)
##       AGE         < 34.5   to the left,  agree=0.698, adj=0.182, (0 split)
##       AMOUNT      < 4212.5 to the left,  agree=0.664, adj=0.091, (0 split)
##       SAV_ACCT    splits as  LLLRL,      agree=0.658, adj=0.073, (0 split)
##       DURATION    < 6.5    to the right, agree=0.651, adj=0.055, (0 split)
## 
## Node number 28: 7 observations
##   predicted class=0  expected loss=0.1428571  P(node) =0.01569507
##     class counts:     6     1
##    probabilities: 0.857 0.143 
## 
## Node number 29: 75 observations
##   predicted class=1  expected loss=0.2533333  P(node) =0.1681614
##     class counts:    19    56
##    probabilities: 0.253 0.747 
## 
## Node number 46: 94 observations,    complexity param=0.01457399
##   predicted class=0  expected loss=0.4042553  P(node) =0.2107623
##     class counts:    56    38
##    probabilities: 0.596 0.404 
##   left son=92 (86 obs) right son=93 (8 obs)
##   Primary splits:
##       GUARANTOR   splits as  LR,         improve=6.206828, (0 missing)
##       NEW_CAR     splits as  RL,         improve=3.635910, (0 missing)
##       RADIO.TV    splits as  LR,         improve=2.878234, (0 missing)
##       MALE_SINGLE splits as  LR,         improve=2.142455, (0 missing)
##       AMOUNT      < 1491   to the left,  improve=1.798335, (0 missing)
## 
## Node number 47: 55 observations,    complexity param=0.01195815
##   predicted class=1  expected loss=0.3818182  P(node) =0.1233184
##     class counts:    21    34
##    probabilities: 0.382 0.618 
##   left son=94 (46 obs) right son=95 (9 obs)
##   Primary splits:
##       EMPLOYMENT    splits as  LLLRL,      improve=3.137549, (0 missing)
##       TELEPHONE     splits as  LR,         improve=1.982155, (0 missing)
##       SAV_ACCT      splits as  LRLLR,      improve=1.941414, (0 missing)
##       OTHER_INSTALL splits as  RL,         improve=1.857409, (0 missing)
##       AMOUNT        < 2627   to the left,  improve=1.056516, (0 missing)
##   Surrogate splits:
##       AMOUNT < 678.5  to the right, agree=0.855, adj=0.111, (0 split)
## 
## Node number 92: 86 observations
##   predicted class=0  expected loss=0.3488372  P(node) =0.1928251
##     class counts:    56    30
##    probabilities: 0.651 0.349 
## 
## Node number 93: 8 observations
##   predicted class=1  expected loss=0  P(node) =0.01793722
##     class counts:     0     8
##    probabilities: 0.000 1.000 
## 
## Node number 94: 46 observations,    complexity param=0.01195815
##   predicted class=1  expected loss=0.4565217  P(node) =0.103139
##     class counts:    21    25
##    probabilities: 0.457 0.543 
##   left son=188 (30 obs) right son=189 (16 obs)
##   Primary splits:
##       AMOUNT        < 2627   to the left,  improve=2.0927540, (0 missing)
##       SAV_ACCT      splits as  LRLLR,      improve=1.6246220, (0 missing)
##       TELEPHONE     splits as  LR,         improve=1.6139660, (0 missing)
##       OTHER_INSTALL splits as  RL,         improve=1.4339300, (0 missing)
##       EMPLOYMENT    splits as  LLR-R,      improve=0.9882491, (0 missing)
##   Surrogate splits:
##       INSTALL_RATE < 2.5    to the right, agree=0.826, adj=0.500, (0 split)
##       TELEPHONE    splits as  LR,         agree=0.739, adj=0.250, (0 split)
##       FURNITURE    splits as  LR,         agree=0.696, adj=0.125, (0 split)
##       USED_CAR     splits as  LR,         agree=0.674, adj=0.062, (0 split)
##       EDUCATION    splits as  LR,         agree=0.674, adj=0.062, (0 split)
## 
## Node number 95: 9 observations
##   predicted class=1  expected loss=0  P(node) =0.02017937
##     class counts:     0     9
##    probabilities: 0.000 1.000 
## 
## Node number 188: 30 observations,    complexity param=0.01195815
##   predicted class=0  expected loss=0.4333333  P(node) =0.06726457
##     class counts:    17    13
##    probabilities: 0.567 0.433 
##   left son=376 (14 obs) right son=377 (16 obs)
##   Primary splits:
##       DURATION < 17     to the right, improve=2.5190480, (0 missing)
##       AMOUNT   < 1911   to the right, improve=1.5407870, (0 missing)
##       SAV_ACCT splits as  LLRRR,      improve=1.4000000, (0 missing)
##       AGE      < 37     to the left,  improve=0.8960128, (0 missing)
##       JOB      splits as  RRLR,       improve=0.8333333, (0 missing)
##   Surrogate splits:
##       MALE_SINGLE      splits as  LR,         agree=0.733, adj=0.429, (0 split)
##       AGE              < 34     to the left,  agree=0.700, adj=0.357, (0 split)
##       MALE_MAR_or_WID  splits as  RL,         agree=0.667, adj=0.286, (0 split)
##       AMOUNT           < 1569.5 to the right, agree=0.633, adj=0.214, (0 split)
##       PRESENT_RESIDENT splits as  -LRR,       agree=0.633, adj=0.214, (0 split)
## 
## Node number 189: 16 observations
##   predicted class=1  expected loss=0.25  P(node) =0.03587444
##     class counts:     4    12
##    probabilities: 0.250 0.750 
## 
## Node number 376: 14 observations
##   predicted class=0  expected loss=0.2142857  P(node) =0.03139013
##     class counts:    11     3
##    probabilities: 0.786 0.214 
## 
## Node number 377: 16 observations
##   predicted class=1  expected loss=0.375  P(node) =0.03587444
##     class counts:     6    10
##    probabilities: 0.375 0.625

Then we plot the classification tree.

rpart.plot(german.ct)

We see that among the 31 explanatory variables, this model uses 6 of them: CHK_ACCT, AMOUNT, HISTORY, DURATION, GUARANTOR and EMPLOYMENT.

german.bigct.prediction <- predict(german.ct, 
                                   newdata=German_credit.te,
                                   type="class")

confusionMatrix(data=german.bigct.prediction , reference = German_credit.te$RESPONSE)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  58  70
##          1  19 103
##                                           
##                Accuracy : 0.644           
##                  95% CI : (0.5812, 0.7033)
##     No Information Rate : 0.692           
##     P-Value [Acc > NIR] : 0.9552          
##                                           
##                   Kappa : 0.2945          
##                                           
##  Mcnemar's Test P-Value : 1.158e-07       
##                                           
##             Sensitivity : 0.7532          
##             Specificity : 0.5954          
##          Pos Pred Value : 0.4531          
##          Neg Pred Value : 0.8443          
##              Prevalence : 0.3080          
##          Detection Rate : 0.2320          
##    Detection Prevalence : 0.5120          
##       Balanced Accuracy : 0.6743          
##                                           
##        'Positive' Class : 0               
## 

As the tree is quite big and we want to know if we can prune it.

Pruning the tree

printcp(german.ct)
## 
## Classification tree:
## rpart(formula = RESPONSE ~ ., data = German_Credit.tr.subs, method = "class")
## 
## Variables actually used in tree construction:
## [1] AMOUNT     CHK_ACCT   DURATION   EMPLOYMENT GUARANTOR  HISTORY   
## 
## Root node error: 223/446 = 0.5
## 
## n= 446 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.399103      0   1.00000 1.08520 0.047179
## 2 0.022422      1   0.60090 0.63677 0.044117
## 3 0.014574      2   0.57848 0.66816 0.044668
## 4 0.011958      7   0.48430 0.69058 0.045028
## 5 0.011211     10   0.44843 0.67265 0.044742
## 6 0.010000     12   0.42601 0.67265 0.044742
plotcp(german.ct)

From the list of cp (complexity parameter), we would choose the line that has the lowest cross valiadation error. This can be seen on the column xerror. So the best cp would be 0.022422 with one split.

From the graph, we can identify that, according to the 1-SE rule, the tree with 2 and 3 are equivalent. The tree with 3 nodes should be preferred. It appears below the dotted-line.

The value of cp can be chosen arbitrarily between 0.018 and 0.095. So we decided to go with the suggested cp of 0.022 from the summary.

german.ct.prune <- prune(german.ct, cp=0.022)
rpart.plot(german.ct.prune)

This pruned decision tree with a cp of 0.022 uses the variables CHK_ACCT and HISTORY.

Using this pruned tree we made by hand, we can compute the prediction and build a confusion matrix.

german.ct.prediction <- predict(german.ct.prune, newdata=German_credit.te, type="class")

# Confusion Matrix
confusionMatrix(data=german.ct.prediction, reference = German_credit.te$RESPONSE)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 63 95
##          1 14 78
##                                           
##                Accuracy : 0.564           
##                  95% CI : (0.5001, 0.6264)
##     No Information Rate : 0.692           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2083          
##                                           
##  Mcnemar's Test P-Value : 1.822e-14       
##                                           
##             Sensitivity : 0.8182          
##             Specificity : 0.4509          
##          Pos Pred Value : 0.3987          
##          Neg Pred Value : 0.8478          
##              Prevalence : 0.3080          
##          Detection Rate : 0.2520          
##    Detection Prevalence : 0.6320          
##       Balanced Accuracy : 0.6345          
##                                           
##        'Positive' Class : 0               
## 

The accuracy of the classification tree model is not really good (0.564), however we see that the balanced accuracy (0.6345) is a little bit better but still not enough.

We decide to also look at what would an automatically pruned using 1-SE rule would give use and compare the model.

set.seed(1234)
german.ct.autoprune <- autoprune(RESPONSE ~ ., method = "class",
                                 data = German_Credit.tr.subs)
rpart.plot(german.ct.autoprune )
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

Here, only the variable CHK_ACCT is used.

german.ct.autoprune.prediction <- predict(german.ct.autoprune,
                                          newdata=German_credit.te,
                                          type="class")

# Confusion Matrix
confusionMatrix(data=german.ct.autoprune.prediction,
                reference = German_credit.te$RESPONSE)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 61 88
##          1 16 85
##                                           
##                Accuracy : 0.584           
##                  95% CI : (0.5202, 0.6458)
##     No Information Rate : 0.692           
##     P-Value [Acc > NIR] : 0.9999          
##                                           
##                   Kappa : 0.2251          
##                                           
##  Mcnemar's Test P-Value : 3.352e-12       
##                                           
##             Sensitivity : 0.7922          
##             Specificity : 0.4913          
##          Pos Pred Value : 0.4094          
##          Neg Pred Value : 0.8416          
##              Prevalence : 0.3080          
##          Detection Rate : 0.2440          
##    Detection Prevalence : 0.5960          
##       Balanced Accuracy : 0.6418          
##                                           
##        'Positive' Class : 0               
## 

The accuracy of the classification tree model is not really good (0.584) althought it is a little bit better than the one we fitted by hand. We also see that the balanced accuracy (0.6418) is a little bit better than its accuracy and the balanced accuracy of the other model, but still not enough.

Variable importance of the classification tree

varImp(german.ct.prune)
##                    Overall
## AMOUNT           17.947179
## CHK_ACCT         37.098755
## DURATION         11.073258
## EMPLOYMENT        4.645266
## HISTORY          18.840050
## OTHER_INSTALL     3.216630
## RETRAINING        3.509915
## SAV_ACCT          9.538067
## NEW_CAR           0.000000
## USED_CAR          0.000000
## FURNITURE         0.000000
## RADIO.TV          0.000000
## EDUCATION         0.000000
## INSTALL_RATE      0.000000
## MALE_DIV          0.000000
## MALE_SINGLE       0.000000
## MALE_MAR_or_WID   0.000000
## CO.APPLICANT      0.000000
## GUARANTOR         0.000000
## PRESENT_RESIDENT  0.000000
## REAL_ESTATE       0.000000
## PROP_UNKN_NONE    0.000000
## AGE               0.000000
## RENT              0.000000
## OWN_RES           0.000000
## NUM_CREDITS       0.000000
## JOB               0.000000
## NUM_DEPENDENTS    0.000000
## TELEPHONE         0.000000
## FOREIGN           0.000000
vip(german.ct.prune)

From this plot, we see that the variables that influences the most are CHK_ACCT, HISTORY, SAV_ACCT, DURATION, AMOUNT and REAL_ESTATE. They are not exactly the same as the one we saw above.

The variable CHK_ACCT and HISTORY were noticed though.

Fitting another model : Logistic Regression

# Logistic regression to see the significant variables (not working)
logreg <- glm(RESPONSE~., data = German_Credit.tr.subs, family= binomial)
summary(logreg)
## 
## Call:
## glm(formula = RESPONSE ~ ., family = binomial, data = German_Credit.tr.subs)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.34578  -0.68043   0.00049   0.65178   2.74937  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        1.1911402  1.7958756   0.663 0.507161    
## CHK_ACCT1          0.5692882  0.3363406   1.693 0.090533 .  
## CHK_ACCT2          0.8404451  0.5339512   1.574 0.115485    
## CHK_ACCT3          2.4337691  0.3770606   6.455 1.09e-10 ***
## DURATION          -0.0123731  0.0142153  -0.870 0.384078    
## HISTORY1          -1.0734853  0.8514386  -1.261 0.207384    
## HISTORY2           0.0865599  0.6747882   0.128 0.897930    
## HISTORY3          -0.0598560  0.7410028  -0.081 0.935619    
## HISTORY4           1.1072483  0.6576414   1.684 0.092246 .  
## NEW_CAR1          -0.4538649  0.5853211  -0.775 0.438096    
## USED_CAR1          1.6322817  0.7540134   2.165 0.030404 *  
## FURNITURE1         0.0509645  0.6182782   0.082 0.934305    
## RADIO.TV1          0.5261147  0.5896893   0.892 0.372291    
## EDUCATION1         0.5441469  0.7499724   0.726 0.468111    
## RETRAINING1       -0.4293160  0.6787931  -0.632 0.527080    
## AMOUNT            -0.0002155  0.0000739  -2.916 0.003550 ** 
## SAV_ACCT1          0.6181742  0.4475399   1.381 0.167195    
## SAV_ACCT2         -0.2531524  0.5541205  -0.457 0.647776    
## SAV_ACCT3          0.7292579  0.6813687   1.070 0.284492    
## SAV_ACCT4          1.4221687  0.4243610   3.351 0.000804 ***
## EMPLOYMENT1        0.7574673  0.7956778   0.952 0.341108    
## EMPLOYMENT2        1.4785839  0.7640267   1.935 0.052959 .  
## EMPLOYMENT3        1.9691166  0.7947873   2.478 0.013229 *  
## EMPLOYMENT4        1.8560330  0.7511387   2.471 0.013475 *  
## INSTALL_RATE      -0.3367533  0.1411404  -2.386 0.017035 *  
## MALE_DIV1         -0.5653453  0.5705857  -0.991 0.321775    
## MALE_SINGLE1       0.1618525  0.3327207   0.486 0.626647    
## MALE_MAR_or_WID1  -0.5551862  0.5312986  -1.045 0.296041    
## CO.APPLICANT1     -0.6994379  0.6920599  -1.011 0.312179    
## GUARANTOR1         1.7126786  0.6556150   2.612 0.008993 ** 
## PRESENT_RESIDENT2 -1.1195205  0.4773294  -2.345 0.019008 *  
## PRESENT_RESIDENT3 -0.2590309  0.5313455  -0.487 0.625904    
## PRESENT_RESIDENT4 -0.9082582  0.4793144  -1.895 0.058104 .  
## REAL_ESTATE1      -0.0137202  0.3384983  -0.041 0.967669    
## PROP_UNKN_NONE1   -1.4578770  0.6505748  -2.241 0.025032 *  
## AGE                0.0167050  0.0141041   1.184 0.236255    
## OTHER_INSTALL1    -0.6758552  0.3404321  -1.985 0.047113 *  
## RENT1             -1.2066453  0.8244600  -1.464 0.143315    
## OWN_RES1          -0.4707135  0.7665544  -0.614 0.539173    
## NUM_CREDITS       -0.3634820  0.3011721  -1.207 0.227474    
## JOB1              -0.7402802  1.1619781  -0.637 0.524069    
## JOB2              -1.2142377  1.1317833  -1.073 0.283337    
## JOB3              -1.4358446  1.1604352  -1.237 0.215964    
## NUM_DEPENDENTS     0.1270172  0.3832474   0.331 0.740325    
## TELEPHONE1         0.6259633  0.3143236   1.991 0.046430 *  
## FOREIGN1           1.2496315  0.8543880   1.463 0.143576    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 618.29  on 445  degrees of freedom
## Residual deviance: 390.97  on 400  degrees of freedom
## AIC: 482.97
## 
## Number of Fisher Scoring iterations: 5

We see that a lot of variables are not statistically significant for the model so we can think of a model reduction.

We can predict using the logistic regression.

german.logrer.pred_prob <- predict(logreg, newdata = German_credit.te,
                                    type="response")

german.logrer.pred <- ifelse(german.logrer.pred_prob >= 0.5, 1, 0)

# Confusion Matrix
confusionMatrix(data=as.factor(german.logrer.pred),
                reference = German_credit.te$RESPONSE)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  49  46
##          1  28 127
##                                           
##                Accuracy : 0.704           
##                  95% CI : (0.6432, 0.7599)
##     No Information Rate : 0.692           
##     P-Value [Acc > NIR] : 0.36896         
##                                           
##                   Kappa : 0.3479          
##                                           
##  Mcnemar's Test P-Value : 0.04813         
##                                           
##             Sensitivity : 0.6364          
##             Specificity : 0.7341          
##          Pos Pred Value : 0.5158          
##          Neg Pred Value : 0.8194          
##              Prevalence : 0.3080          
##          Detection Rate : 0.1960          
##    Detection Prevalence : 0.3800          
##       Balanced Accuracy : 0.6852          
##                                           
##        'Positive' Class : 0               
## 

From the confusion matrix summary, this model is better than the classification tree as the accuracy is higher (0.704) and that the balanced accuracy seems to be almost good as well ( 0.6852)

Variable selection and interpretation with step method (AIC criteria)

The stepwise variable selection is performed.

mod.logreg.sel <- step(logreg) # store the final model into mod.logreg.sel
## Start:  AIC=482.97
## RESPONSE ~ CHK_ACCT + DURATION + HISTORY + NEW_CAR + USED_CAR + 
##     FURNITURE + RADIO.TV + EDUCATION + RETRAINING + AMOUNT + 
##     SAV_ACCT + EMPLOYMENT + INSTALL_RATE + MALE_DIV + MALE_SINGLE + 
##     MALE_MAR_or_WID + CO.APPLICANT + GUARANTOR + PRESENT_RESIDENT + 
##     REAL_ESTATE + PROP_UNKN_NONE + AGE + OTHER_INSTALL + RENT + 
##     OWN_RES + NUM_CREDITS + JOB + NUM_DEPENDENTS + TELEPHONE + 
##     FOREIGN
## 
##                    Df Deviance    AIC
## - JOB               3   394.07 480.07
## - REAL_ESTATE       1   390.97 480.97
## - FURNITURE         1   390.98 480.98
## - NUM_DEPENDENTS    1   391.08 481.08
## - MALE_SINGLE       1   391.21 481.21
## - OWN_RES           1   391.35 481.35
## - RETRAINING        1   391.37 481.37
## - EDUCATION         1   391.50 481.50
## - NEW_CAR           1   391.57 481.57
## - DURATION          1   391.73 481.73
## - RADIO.TV          1   391.77 481.77
## - MALE_DIV          1   391.97 481.97
## - CO.APPLICANT      1   392.06 482.06
## - MALE_MAR_or_WID   1   392.07 482.07
## - AGE               1   392.39 482.39
## - NUM_CREDITS       1   392.46 482.46
## <none>                  390.97 482.97
## - RENT              1   393.15 483.15
## - FOREIGN           1   393.24 483.24
## - OTHER_INSTALL     1   395.01 485.01
## - TELEPHONE         1   395.01 485.01
## - PRESENT_RESIDENT  3   399.08 485.08
## - USED_CAR          1   395.81 485.81
## - PROP_UNKN_NONE    1   396.35 486.35
## - INSTALL_RATE      1   396.83 486.83
## - EMPLOYMENT        4   404.07 488.07
## - GUARANTOR         1   398.34 488.34
## - SAV_ACCT          4   404.98 488.98
## - HISTORY           4   405.97 489.97
## - AMOUNT            1   400.48 490.48
## - CHK_ACCT          3   443.96 529.96
## 
## Step:  AIC=480.07
## RESPONSE ~ CHK_ACCT + DURATION + HISTORY + NEW_CAR + USED_CAR + 
##     FURNITURE + RADIO.TV + EDUCATION + RETRAINING + AMOUNT + 
##     SAV_ACCT + EMPLOYMENT + INSTALL_RATE + MALE_DIV + MALE_SINGLE + 
##     MALE_MAR_or_WID + CO.APPLICANT + GUARANTOR + PRESENT_RESIDENT + 
##     REAL_ESTATE + PROP_UNKN_NONE + AGE + OTHER_INSTALL + RENT + 
##     OWN_RES + NUM_CREDITS + NUM_DEPENDENTS + TELEPHONE + FOREIGN
## 
##                    Df Deviance    AIC
## - FURNITURE         1   394.07 478.07
## - REAL_ESTATE       1   394.14 478.14
## - NUM_DEPENDENTS    1   394.36 478.36
## - EDUCATION         1   394.40 478.40
## - MALE_SINGLE       1   394.53 478.53
## - RETRAINING        1   394.54 478.54
## - RADIO.TV          1   394.64 478.64
## - OWN_RES           1   394.77 478.77
## - NEW_CAR           1   394.77 478.77
## - MALE_MAR_or_WID   1   395.00 479.00
## - DURATION          1   395.18 479.18
## - NUM_CREDITS       1   395.28 479.28
## - MALE_DIV          1   395.34 479.34
## - AGE               1   395.46 479.46
## - CO.APPLICANT      1   395.57 479.57
## <none>                  394.07 480.07
## - FOREIGN           1   396.86 480.86
## - TELEPHONE         1   396.96 480.96
## - RENT              1   397.22 481.22
## - PRESENT_RESIDENT  3   401.89 481.89
## - USED_CAR          1   397.93 481.93
## - OTHER_INSTALL     1   397.95 481.95
## - EMPLOYMENT        4   406.08 484.08
## - INSTALL_RATE      1   400.30 484.30
## - PROP_UNKN_NONE    1   400.37 484.37
## - GUARANTOR         1   401.43 485.43
## - SAV_ACCT          4   407.95 485.95
## - HISTORY           4   409.38 487.38
## - AMOUNT            1   403.88 487.88
## - CHK_ACCT          3   445.71 525.71
## 
## Step:  AIC=478.07
## RESPONSE ~ CHK_ACCT + DURATION + HISTORY + NEW_CAR + USED_CAR + 
##     RADIO.TV + EDUCATION + RETRAINING + AMOUNT + SAV_ACCT + EMPLOYMENT + 
##     INSTALL_RATE + MALE_DIV + MALE_SINGLE + MALE_MAR_or_WID + 
##     CO.APPLICANT + GUARANTOR + PRESENT_RESIDENT + REAL_ESTATE + 
##     PROP_UNKN_NONE + AGE + OTHER_INSTALL + RENT + OWN_RES + NUM_CREDITS + 
##     NUM_DEPENDENTS + TELEPHONE + FOREIGN
## 
##                    Df Deviance    AIC
## - REAL_ESTATE       1   394.14 476.14
## - NUM_DEPENDENTS    1   394.36 476.36
## - MALE_SINGLE       1   394.53 476.53
## - EDUCATION         1   394.57 476.57
## - OWN_RES           1   394.77 476.77
## - RETRAINING        1   394.87 476.87
## - MALE_MAR_or_WID   1   395.02 477.02
## - DURATION          1   395.18 477.18
## - NUM_CREDITS       1   395.29 477.29
## - RADIO.TV          1   395.29 477.29
## - MALE_DIV          1   395.34 477.34
## - AGE               1   395.46 477.46
## - CO.APPLICANT      1   395.57 477.57
## - NEW_CAR           1   395.72 477.72
## <none>                  394.07 478.07
## - FOREIGN           1   396.86 478.86
## - TELEPHONE         1   396.96 478.96
## - RENT              1   397.22 479.22
## - PRESENT_RESIDENT  3   401.92 479.92
## - OTHER_INSTALL     1   397.95 479.95
## - USED_CAR          1   399.89 481.89
## - EMPLOYMENT        4   406.08 482.08
## - PROP_UNKN_NONE    1   400.37 482.37
## - INSTALL_RATE      1   400.39 482.39
## - GUARANTOR         1   401.44 483.44
## - SAV_ACCT          4   407.96 483.96
## - HISTORY           4   409.67 485.67
## - AMOUNT            1   403.93 485.93
## - CHK_ACCT          3   446.04 524.04
## 
## Step:  AIC=476.14
## RESPONSE ~ CHK_ACCT + DURATION + HISTORY + NEW_CAR + USED_CAR + 
##     RADIO.TV + EDUCATION + RETRAINING + AMOUNT + SAV_ACCT + EMPLOYMENT + 
##     INSTALL_RATE + MALE_DIV + MALE_SINGLE + MALE_MAR_or_WID + 
##     CO.APPLICANT + GUARANTOR + PRESENT_RESIDENT + PROP_UNKN_NONE + 
##     AGE + OTHER_INSTALL + RENT + OWN_RES + NUM_CREDITS + NUM_DEPENDENTS + 
##     TELEPHONE + FOREIGN
## 
##                    Df Deviance    AIC
## - NUM_DEPENDENTS    1   394.44 474.44
## - MALE_SINGLE       1   394.60 474.60
## - EDUCATION         1   394.62 474.62
## - OWN_RES           1   394.83 474.83
## - RETRAINING        1   394.90 474.90
## - MALE_MAR_or_WID   1   395.04 475.04
## - DURATION          1   395.25 475.25
## - NUM_CREDITS       1   395.34 475.34
## - MALE_DIV          1   395.36 475.36
## - RADIO.TV          1   395.43 475.43
## - AGE               1   395.54 475.54
## - CO.APPLICANT      1   395.62 475.62
## - NEW_CAR           1   395.75 475.75
## <none>                  394.14 476.14
## - FOREIGN           1   396.89 476.89
## - TELEPHONE         1   396.96 476.96
## - RENT              1   397.28 477.28
## - PRESENT_RESIDENT  3   402.07 478.07
## - OTHER_INSTALL     1   398.14 478.14
## - USED_CAR          1   400.04 480.04
## - EMPLOYMENT        4   406.12 480.12
## - PROP_UNKN_NONE    1   400.61 480.61
## - INSTALL_RATE      1   400.63 480.63
## - GUARANTOR         1   401.92 481.92
## - SAV_ACCT          4   408.08 482.08
## - HISTORY           4   409.69 483.69
## - AMOUNT            1   404.35 484.35
## - CHK_ACCT          3   447.81 523.81
## 
## Step:  AIC=474.44
## RESPONSE ~ CHK_ACCT + DURATION + HISTORY + NEW_CAR + USED_CAR + 
##     RADIO.TV + EDUCATION + RETRAINING + AMOUNT + SAV_ACCT + EMPLOYMENT + 
##     INSTALL_RATE + MALE_DIV + MALE_SINGLE + MALE_MAR_or_WID + 
##     CO.APPLICANT + GUARANTOR + PRESENT_RESIDENT + PROP_UNKN_NONE + 
##     AGE + OTHER_INSTALL + RENT + OWN_RES + NUM_CREDITS + TELEPHONE + 
##     FOREIGN
## 
##                    Df Deviance    AIC
## - EDUCATION         1   394.92 472.92
## - MALE_SINGLE       1   395.09 473.09
## - OWN_RES           1   395.17 473.17
## - RETRAINING        1   395.17 473.17
## - MALE_MAR_or_WID   1   395.38 473.38
## - NUM_CREDITS       1   395.62 473.62
## - DURATION          1   395.62 473.62
## - MALE_DIV          1   395.65 473.65
## - RADIO.TV          1   395.72 473.72
## - AGE               1   395.86 473.86
## - CO.APPLICANT      1   395.92 473.92
## - NEW_CAR           1   395.93 473.93
## <none>                  394.44 474.44
## - FOREIGN           1   397.19 475.19
## - TELEPHONE         1   397.23 475.23
## - RENT              1   397.64 475.64
## - OTHER_INSTALL     1   398.27 476.27
## - PRESENT_RESIDENT  3   402.33 476.33
## - USED_CAR          1   400.66 478.66
## - EMPLOYMENT        4   406.68 478.68
## - PROP_UNKN_NONE    1   400.76 478.76
## - INSTALL_RATE      1   401.31 479.31
## - GUARANTOR         1   402.15 480.15
## - SAV_ACCT          4   408.32 480.32
## - HISTORY           4   409.71 481.71
## - AMOUNT            1   404.85 482.85
## - CHK_ACCT          3   448.11 522.11
## 
## Step:  AIC=472.92
## RESPONSE ~ CHK_ACCT + DURATION + HISTORY + NEW_CAR + USED_CAR + 
##     RADIO.TV + RETRAINING + AMOUNT + SAV_ACCT + EMPLOYMENT + 
##     INSTALL_RATE + MALE_DIV + MALE_SINGLE + MALE_MAR_or_WID + 
##     CO.APPLICANT + GUARANTOR + PRESENT_RESIDENT + PROP_UNKN_NONE + 
##     AGE + OTHER_INSTALL + RENT + OWN_RES + NUM_CREDITS + TELEPHONE + 
##     FOREIGN
## 
##                    Df Deviance    AIC
## - MALE_SINGLE       1   395.58 471.58
## - OWN_RES           1   395.70 471.70
## - MALE_MAR_or_WID   1   395.82 471.82
## - RADIO.TV          1   395.83 471.83
## - DURATION          1   396.03 472.03
## - RETRAINING        1   396.05 472.05
## - NUM_CREDITS       1   396.12 472.12
## - MALE_DIV          1   396.16 472.16
## - AGE               1   396.46 472.46
## - CO.APPLICANT      1   396.54 472.54
## <none>                  394.92 472.92
## - NEW_CAR           1   397.36 473.36
## - FOREIGN           1   397.66 473.66
## - TELEPHONE         1   397.84 473.84
## - RENT              1   398.20 474.20
## - OTHER_INSTALL     1   398.71 474.71
## - PRESENT_RESIDENT  3   402.81 474.81
## - USED_CAR          1   400.66 476.66
## - PROP_UNKN_NONE    1   400.97 476.97
## - EMPLOYMENT        4   407.23 477.23
## - INSTALL_RATE      1   401.81 477.81
## - GUARANTOR         1   402.42 478.42
## - SAV_ACCT          4   409.38 479.38
## - HISTORY           4   410.48 480.48
## - AMOUNT            1   405.60 481.60
## - CHK_ACCT          3   448.27 520.27
## 
## Step:  AIC=471.58
## RESPONSE ~ CHK_ACCT + DURATION + HISTORY + NEW_CAR + USED_CAR + 
##     RADIO.TV + RETRAINING + AMOUNT + SAV_ACCT + EMPLOYMENT + 
##     INSTALL_RATE + MALE_DIV + MALE_MAR_or_WID + CO.APPLICANT + 
##     GUARANTOR + PRESENT_RESIDENT + PROP_UNKN_NONE + AGE + OTHER_INSTALL + 
##     RENT + OWN_RES + NUM_CREDITS + TELEPHONE + FOREIGN
## 
##                    Df Deviance    AIC
## - OWN_RES           1   396.32 470.32
## - RADIO.TV          1   396.61 470.61
## - RETRAINING        1   396.63 470.63
## - DURATION          1   396.64 470.64
## - NUM_CREDITS       1   396.75 470.75
## - CO.APPLICANT      1   397.14 471.14
## - MALE_MAR_or_WID   1   397.23 471.23
## - AGE               1   397.36 471.36
## <none>                  395.58 471.58
## - MALE_DIV          1   397.73 471.73
## - NEW_CAR           1   397.94 471.94
## - FOREIGN           1   398.40 472.40
## - TELEPHONE         1   398.56 472.56
## - RENT              1   398.98 472.98
## - OTHER_INSTALL     1   399.33 473.33
## - PRESENT_RESIDENT  3   403.51 473.51
## - USED_CAR          1   401.28 475.28
## - PROP_UNKN_NONE    1   401.47 475.47
## - INSTALL_RATE      1   401.91 475.91
## - GUARANTOR         1   403.51 477.51
## - EMPLOYMENT        4   409.53 477.53
## - SAV_ACCT          4   409.88 477.88
## - HISTORY           4   411.06 479.06
## - AMOUNT            1   405.84 479.84
## - CHK_ACCT          3   448.88 518.88
## 
## Step:  AIC=470.32
## RESPONSE ~ CHK_ACCT + DURATION + HISTORY + NEW_CAR + USED_CAR + 
##     RADIO.TV + RETRAINING + AMOUNT + SAV_ACCT + EMPLOYMENT + 
##     INSTALL_RATE + MALE_DIV + MALE_MAR_or_WID + CO.APPLICANT + 
##     GUARANTOR + PRESENT_RESIDENT + PROP_UNKN_NONE + AGE + OTHER_INSTALL + 
##     RENT + NUM_CREDITS + TELEPHONE + FOREIGN
## 
##                    Df Deviance    AIC
## - DURATION          1   397.20 469.20
## - RADIO.TV          1   397.29 469.29
## - NUM_CREDITS       1   397.39 469.39
## - RETRAINING        1   397.39 469.39
## - MALE_MAR_or_WID   1   397.86 469.86
## - CO.APPLICANT      1   397.91 469.91
## <none>                  396.32 470.32
## - AGE               1   398.39 470.39
## - MALE_DIV          1   398.40 470.40
## - NEW_CAR           1   398.57 470.57
## - FOREIGN           1   398.92 470.92
## - TELEPHONE         1   399.22 471.22
## - OTHER_INSTALL     1   400.14 472.14
## - PRESENT_RESIDENT  3   404.24 472.24
## - RENT              1   402.09 474.09
## - USED_CAR          1   402.59 474.59
## - INSTALL_RATE      1   402.66 474.66
## - PROP_UNKN_NONE    1   403.68 475.68
## - EMPLOYMENT        4   410.14 476.14
## - SAV_ACCT          4   410.22 476.22
## - GUARANTOR         1   404.24 476.24
## - HISTORY           4   411.37 477.37
## - AMOUNT            1   406.67 478.67
## - CHK_ACCT          3   449.30 517.30
## 
## Step:  AIC=469.2
## RESPONSE ~ CHK_ACCT + HISTORY + NEW_CAR + USED_CAR + RADIO.TV + 
##     RETRAINING + AMOUNT + SAV_ACCT + EMPLOYMENT + INSTALL_RATE + 
##     MALE_DIV + MALE_MAR_or_WID + CO.APPLICANT + GUARANTOR + PRESENT_RESIDENT + 
##     PROP_UNKN_NONE + AGE + OTHER_INSTALL + RENT + NUM_CREDITS + 
##     TELEPHONE + FOREIGN
## 
##                    Df Deviance    AIC
## - RADIO.TV          1   398.00 468.00
## - NUM_CREDITS       1   398.24 468.24
## - CO.APPLICANT      1   398.68 468.68
## - RETRAINING        1   398.75 468.75
## - MALE_MAR_or_WID   1   398.82 468.82
## <none>                  397.20 469.20
## - MALE_DIV          1   399.36 469.36
## - NEW_CAR           1   399.59 469.59
## - AGE               1   399.75 469.75
## - FOREIGN           1   400.30 470.30
## - TELEPHONE         1   400.54 470.54
## - OTHER_INSTALL     1   401.18 471.18
## - PRESENT_RESIDENT  3   405.32 471.32
## - RENT              1   402.71 472.71
## - USED_CAR          1   403.35 473.35
## - EMPLOYMENT        4   410.61 474.61
## - SAV_ACCT          4   410.86 474.86
## - GUARANTOR         1   404.97 474.97
## - PROP_UNKN_NONE    1   405.48 475.48
## - INSTALL_RATE      1   405.49 475.49
## - HISTORY           4   412.02 476.02
## - AMOUNT            1   418.20 488.20
## - CHK_ACCT          3   452.31 518.31
## 
## Step:  AIC=468
## RESPONSE ~ CHK_ACCT + HISTORY + NEW_CAR + USED_CAR + RETRAINING + 
##     AMOUNT + SAV_ACCT + EMPLOYMENT + INSTALL_RATE + MALE_DIV + 
##     MALE_MAR_or_WID + CO.APPLICANT + GUARANTOR + PRESENT_RESIDENT + 
##     PROP_UNKN_NONE + AGE + OTHER_INSTALL + RENT + NUM_CREDITS + 
##     TELEPHONE + FOREIGN
## 
##                    Df Deviance    AIC
## - NUM_CREDITS       1   398.89 466.89
## - MALE_MAR_or_WID   1   399.37 467.37
## - CO.APPLICANT      1   399.50 467.50
## <none>                  398.00 468.00
## - MALE_DIV          1   400.33 468.33
## - AGE               1   400.43 468.43
## - RETRAINING        1   400.66 468.66
## - FOREIGN           1   401.11 469.11
## - TELEPHONE         1   401.60 469.60
## - OTHER_INSTALL     1   401.93 469.93
## - PRESENT_RESIDENT  3   406.08 470.08
## - NEW_CAR           1   403.02 471.02
## - USED_CAR          1   403.47 471.47
## - RENT              1   403.84 471.84
## - SAV_ACCT          4   411.51 473.51
## - INSTALL_RATE      1   405.70 473.70
## - EMPLOYMENT        4   411.78 473.78
## - GUARANTOR         1   406.17 474.17
## - HISTORY           4   412.43 474.43
## - PROP_UNKN_NONE    1   406.54 474.54
## - AMOUNT            1   419.79 487.79
## - CHK_ACCT          3   453.56 517.56
## 
## Step:  AIC=466.89
## RESPONSE ~ CHK_ACCT + HISTORY + NEW_CAR + USED_CAR + RETRAINING + 
##     AMOUNT + SAV_ACCT + EMPLOYMENT + INSTALL_RATE + MALE_DIV + 
##     MALE_MAR_or_WID + CO.APPLICANT + GUARANTOR + PRESENT_RESIDENT + 
##     PROP_UNKN_NONE + AGE + OTHER_INSTALL + RENT + TELEPHONE + 
##     FOREIGN
## 
##                    Df Deviance    AIC
## - MALE_MAR_or_WID   1   400.21 466.21
## - CO.APPLICANT      1   400.43 466.43
## <none>                  398.89 466.89
## - MALE_DIV          1   401.09 467.09
## - AGE               1   401.28 467.28
## - RETRAINING        1   401.54 467.54
## - FOREIGN           1   402.04 468.04
## - TELEPHONE         1   402.43 468.43
## - OTHER_INSTALL     1   402.95 468.95
## - PRESENT_RESIDENT  3   407.00 469.00
## - NEW_CAR           1   404.15 470.15
## - USED_CAR          1   404.45 470.45
## - RENT              1   404.91 470.91
## - SAV_ACCT          4   412.22 472.22
## - INSTALL_RATE      1   406.47 472.47
## - HISTORY           4   412.50 472.50
## - EMPLOYMENT        4   412.60 472.60
## - GUARANTOR         1   407.16 473.16
## - PROP_UNKN_NONE    1   407.24 473.24
## - AMOUNT            1   421.00 487.00
## - CHK_ACCT          3   454.25 516.25
## 
## Step:  AIC=466.21
## RESPONSE ~ CHK_ACCT + HISTORY + NEW_CAR + USED_CAR + RETRAINING + 
##     AMOUNT + SAV_ACCT + EMPLOYMENT + INSTALL_RATE + MALE_DIV + 
##     CO.APPLICANT + GUARANTOR + PRESENT_RESIDENT + PROP_UNKN_NONE + 
##     AGE + OTHER_INSTALL + RENT + TELEPHONE + FOREIGN
## 
##                    Df Deviance    AIC
## - CO.APPLICANT      1   401.67 465.67
## - MALE_DIV          1   402.12 466.12
## <none>                  400.21 466.21
## - AGE               1   402.75 466.75
## - FOREIGN           1   402.78 466.78
## - RETRAINING        1   402.90 466.90
## - TELEPHONE         1   403.60 467.60
## - OTHER_INSTALL     1   404.12 468.12
## - PRESENT_RESIDENT  3   408.28 468.28
## - NEW_CAR           1   405.42 469.42
## - USED_CAR          1   405.66 469.66
## - RENT              1   406.25 470.25
## - SAV_ACCT          4   412.79 470.79
## - HISTORY           4   413.36 471.36
## - INSTALL_RATE      1   408.02 472.02
## - GUARANTOR         1   408.12 472.12
## - PROP_UNKN_NONE    1   408.46 472.46
## - EMPLOYMENT        4   414.61 472.61
## - AMOUNT            1   421.27 485.27
## - CHK_ACCT          3   455.55 515.55
## 
## Step:  AIC=465.67
## RESPONSE ~ CHK_ACCT + HISTORY + NEW_CAR + USED_CAR + RETRAINING + 
##     AMOUNT + SAV_ACCT + EMPLOYMENT + INSTALL_RATE + MALE_DIV + 
##     GUARANTOR + PRESENT_RESIDENT + PROP_UNKN_NONE + AGE + OTHER_INSTALL + 
##     RENT + TELEPHONE + FOREIGN
## 
##                    Df Deviance    AIC
## - MALE_DIV          1   403.37 465.37
## <none>                  401.67 465.67
## - AGE               1   404.05 466.05
## - RETRAINING        1   404.28 466.28
## - FOREIGN           1   404.40 466.40
## - TELEPHONE         1   405.19 467.19
## - PRESENT_RESIDENT  3   409.37 467.37
## - OTHER_INSTALL     1   405.66 467.66
## - NEW_CAR           1   406.70 468.70
## - USED_CAR          1   407.58 469.58
## - RENT              1   408.22 470.22
## - HISTORY           4   414.55 470.55
## - SAV_ACCT          4   414.84 470.84
## - INSTALL_RATE      1   409.33 471.33
## - PROP_UNKN_NONE    1   409.84 471.84
## - GUARANTOR         1   410.10 472.10
## - EMPLOYMENT        4   416.15 472.15
## - AMOUNT            1   424.34 486.34
## - CHK_ACCT          3   458.18 516.18
## 
## Step:  AIC=465.37
## RESPONSE ~ CHK_ACCT + HISTORY + NEW_CAR + USED_CAR + RETRAINING + 
##     AMOUNT + SAV_ACCT + EMPLOYMENT + INSTALL_RATE + GUARANTOR + 
##     PRESENT_RESIDENT + PROP_UNKN_NONE + AGE + OTHER_INSTALL + 
##     RENT + TELEPHONE + FOREIGN
## 
##                    Df Deviance    AIC
## <none>                  403.37 465.37
## - AGE               1   405.39 465.39
## - FOREIGN           1   406.08 466.08
## - PRESENT_RESIDENT  3   410.70 466.70
## - RETRAINING        1   406.78 466.78
## - TELEPHONE         1   406.78 466.78
## - OTHER_INSTALL     1   407.15 467.15
## - NEW_CAR           1   408.67 468.67
## - USED_CAR          1   409.53 469.53
## - HISTORY           4   415.74 469.74
## - RENT              1   410.11 470.11
## - SAV_ACCT          4   416.43 470.43
## - INSTALL_RATE      1   410.48 470.48
## - PROP_UNKN_NONE    1   411.18 471.18
## - EMPLOYMENT        4   417.60 471.60
## - GUARANTOR         1   411.85 471.85
## - AMOUNT            1   426.51 486.51
## - CHK_ACCT          3   460.85 516.85
summary(mod.logreg.sel)
## 
## Call:
## glm(formula = RESPONSE ~ CHK_ACCT + HISTORY + NEW_CAR + USED_CAR + 
##     RETRAINING + AMOUNT + SAV_ACCT + EMPLOYMENT + INSTALL_RATE + 
##     GUARANTOR + PRESENT_RESIDENT + PROP_UNKN_NONE + AGE + OTHER_INSTALL + 
##     RENT + TELEPHONE + FOREIGN, family = binomial, data = German_Credit.tr.subs)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.39343  -0.68768  -0.02628   0.71315   2.60726  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -0.6339113  1.1570626  -0.548 0.583786    
## CHK_ACCT1          0.5970566  0.3291383   1.814 0.069678 .  
## CHK_ACCT2          1.1123874  0.5042812   2.206 0.027392 *  
## CHK_ACCT3          2.4109175  0.3597629   6.701 2.06e-11 ***
## HISTORY1          -0.6393459  0.8007632  -0.798 0.424626    
## HISTORY2           0.3153810  0.6295178   0.501 0.616379    
## HISTORY3           0.0245812  0.7411154   0.033 0.973541    
## HISTORY4           1.0638624  0.6476870   1.643 0.100475    
## NEW_CAR1          -0.7159178  0.3141611  -2.279 0.022678 *  
## USED_CAR1          1.3489217  0.5579072   2.418 0.015614 *  
## RETRAINING1       -0.8489518  0.4619050  -1.838 0.066072 .  
## AMOUNT            -0.0002631  0.0000595  -4.421 9.81e-06 ***
## SAV_ACCT1          0.5675624  0.4173990   1.360 0.173906    
## SAV_ACCT2         -0.0693577  0.5399345  -0.128 0.897788    
## SAV_ACCT3          0.5603771  0.6437202   0.871 0.384011    
## SAV_ACCT4          1.3592948  0.4069584   3.340 0.000837 ***
## EMPLOYMENT1        0.5542570  0.6967423   0.795 0.426324    
## EMPLOYMENT2        1.2338686  0.6524020   1.891 0.058588 .  
## EMPLOYMENT3        1.7999683  0.6887566   2.613 0.008966 ** 
## EMPLOYMENT4        1.5521376  0.6518729   2.381 0.017264 *  
## INSTALL_RATE      -0.3278020  0.1249721  -2.623 0.008716 ** 
## GUARANTOR1         1.6927223  0.6068573   2.789 0.005282 ** 
## PRESENT_RESIDENT2 -1.1117822  0.4641005  -2.396 0.016595 *  
## PRESENT_RESIDENT3 -0.3408387  0.5041109  -0.676 0.498966    
## PRESENT_RESIDENT4 -0.7613632  0.4531619  -1.680 0.092935 .  
## PROP_UNKN_NONE1   -1.0532655  0.3848454  -2.737 0.006203 ** 
## AGE                0.0181856  0.0128738   1.413 0.157769    
## OTHER_INSTALL1    -0.6281982  0.3256821  -1.929 0.053747 .  
## RENT1             -0.8736712  0.3412119  -2.560 0.010452 *  
## TELEPHONE1         0.5251823  0.2863172   1.834 0.066614 .  
## FOREIGN1           1.2896516  0.8049248   1.602 0.109111    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 618.29  on 445  degrees of freedom
## Residual deviance: 403.37  on 415  degrees of freedom
## AIC: 465.37
## 
## Number of Fisher Scoring iterations: 5

The variables that have been removed are : FURNITURE, RADIO.TV, EDUCATION, RETRAINING, MALE_DIV, MALE_SINGLE, MALE_MAR_or_WID, CO.APPLICANT, REAL_ESTATE, OWN_RES, NUM_CREDITS, JOB and NUM_DEPENDENTS

In the end, we get the most significant model:

\[ RESPONSE = -0.6339113 + 0.5970566 * CHK_{ACCT1} + 1.1123874 * CHK_{ACCT2} + 2.4109175 * CHK_{ACCT3} - 0.6393459 * HISTORY1 + 0.3153810 * HISTORY2 \\ + 0.0245812 * HISTORY3 + 1.0638624 * HISTORY4 - 0.7159178 * NEW_{CAR1} \\ + 1.3489217 * USED_{CAR1} -0.8489518 * RETRAINING1 - 0.0002631 * AMOUNT \\ + 0.5675624 * SAV_{ACCT1} - 0.0693577 * SAV_{ACCT2} + 0.5603771 * SAV_{ACCT3} \\ + 1.3592948 * * SAV_{ACCT4} + 0.5542570 * EMPLOYMENT1 + 1.2338686 * EMPLOYMENT2 \\ + 1.7999683 * EMPLOYMENT3 + 1.5521376 * EMPLOYMENT4 - 0.3278020 * INSTALL_{RATE} \\ + 1.6927223 * GUARANTOR1 - 1.1117822 * PRESENT_{RESIDENT2} \\ - 0.3408387 * PRESENT_{RESIDENT3} - 0.7613632 * PRESENT_{RESIDENT4} \\ - 1.0532655 * PROP_{UNKN_NONE1} + 0.0181856 * AGE - 0.6281982 * OTHER_{INSTALL1} \\ - 0.8736712 * RENT1 + 0.5251823 * TELEPHONE + 1.2896516 * FOREIGN \]

\[ p = (e^{RESPONSE})/(1 + e^{RESPONSE})\]

It means that :

  • The predicted probability of being a good applicant for CHCK_ACCT3 is higher than for CHK_ACCT0 (and also higher than for CHK_ACCT1 and CHK_ACCT2).
  • The predicted probability of being a good applicant for HISTORY1 is lower than for HISTORY0.
  • The predicted probability of being a good applicant for HISTORY4 is higher than for HISTORY0 (and also higher than for HISTORY2 and HISTORY3).
  • The predicted probability of being a good applicant for NEW_CAR1 is lower than for NEW_CAR0.
  • The predicted probability of being a good applicant for USED_CAR1 is higher than for USED_CAR0.
  • The predicted probability of being a good applicant for RETRAINING1 is lower than for RETRAINING0.
  • AMOUNT is negatively associated with RESPONSE.
  • The predicted probability of being a good applicant for SAV_ACCT4 is higher than for SAV_ACCT0 (and also higher than for SAV_ACCT1 and SAV_ACCT3).
  • The predicted probability of being a good applicant for SAV_ACCT2 lower than for SAV_ACCT0.
  • The predicted probability of being a good applicant for EMPLOYMENT3 is higher than for Employment0 (and also higher than for EMPLOYMENT1, EMPLOYMENT2 and EMPLOYMENT4).
  • INSTALL_RATE is negatively associated with RESPONSE.
  • The predicted probability of being a good applicant for GUARANTOR1 is higher than for GUARANTOR0.
  • The predicted probability of being a good applicant for PRESENT_RESIDENT2 is lower than for PRESENT_RESIDENT0 (and also lower than PRESENT_RESIDENT3 and PRESENT_RESIDENT4).
  • The predicted probability of being a good applicant for PROP_UNKN_NONE1 is lower than for PROP_UNKN_NONE0.
  • AGE is positively associated with RESPONSE.
  • The predicted probability of being a good applicant for OTHER_INSTALL1 is lower than for OTHER_INSTALL0.
  • The predicted probability of being a good applicant for RENT1 is lower than for RENT0.
  • The predicted probability of being a good applicant for TELEPHONE1 is higher than for TELEPHONE0.
  • The predicted probability of being a good applicant for FOREIGN1 is higher than for FOREIGN0.
german.logregsel.pred_prob <- predict(mod.logreg.sel, newdata = German_credit.te,
                                    type="response")

german.logregsel.pred <- ifelse(german.logregsel.pred_prob  >= 0.5, 1, 0)

# Confusion Matrix
confusionMatrix(data=as.factor(german.logregsel.pred),
                reference = German_credit.te$RESPONSE)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  50  52
##          1  27 121
##                                           
##                Accuracy : 0.684           
##                  95% CI : (0.6224, 0.7411)
##     No Information Rate : 0.692           
##     P-Value [Acc > NIR] : 0.63688         
##                                           
##                   Kappa : 0.32            
##                                           
##  Mcnemar's Test P-Value : 0.00693         
##                                           
##             Sensitivity : 0.6494          
##             Specificity : 0.6994          
##          Pos Pred Value : 0.4902          
##          Neg Pred Value : 0.8176          
##              Prevalence : 0.3080          
##          Detection Rate : 0.2000          
##    Detection Prevalence : 0.4080          
##       Balanced Accuracy : 0.6744          
##                                           
##        'Positive' Class : 0               
## 

Variable importance for logistic regression

x_train <- select(German_Credit.tr.subs, -RESPONSE)
y_train <- pull(German_Credit.tr.subs, RESPONSE)

explainer_logreg <- explain(model = mod.logreg.sel, 
                               data = x_train, 
                               y = as.numeric(y_train), 
                               label = "Logistic Regression")
## Preparation of a new explainer is initiated
##   -> model label       :  Logistic Regression 
##   -> data              :  446  rows  30  cols 
##   -> target variable   :  446  values 
##   -> predict function  :  yhat.glm  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package stats , ver. 4.1.3 , task classification (  default  ) 
##   -> predicted values  :  numerical, min =  0.007541882 , mean =  0.5 , max =  0.9975191  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  0.05702613 , mean =  1 , max =  1.96659  
##   A new explainer has been created!
importance_logreg  <- calculate_importance(explainer_logreg)
plot(importance_logreg)

vip(mod.logreg.sel)

Listed above are the most important variables for the logarithmic regression we reduced.

Fitting another model : KNN

To perform a k-nearest neighbor method, we do not need to balance the data so we will use the unbalanced training set. Now, we make the prediction using a 2-NN (with Euclidean distance).

German_credit.num.response <- German_credit_cleaned %>% 
  select('RESPONSE','DURATION', 'AMOUNT', 
         'INSTALL_RATE', 'AGE', 'NUM_CREDITS','NUM_DEPENDENTS')
set.seed(123)
## build the K-NN model with k = 2 
German_credit.knn <- knn3(data=German_credit.num.response, RESPONSE~., k=2) 

German_credit.knn.pred <- predict(German_credit.knn, 
                                  newdata = German_credit.num.response,
                                  type="class")

Now we can use the 2-NN to predict the test set using the training set. Note that the model is fitting on the training set and the predictions are computed on the test set.

set.seed(123)
# applying Knn model with k = 2 on the training set
German_credit.knn.tr <- knn3(data=German_credit.tr, RESPONSE~., k=2)  

German_credit.knn.tr.pred <- predict(German_credit.knn.tr, 
                                     newdata = German_credit.te, type="class")

To compare the predictions above and the true state of the applicant (the one in the test set), we can build a table. It is called a confusion matrix (again, this will be detailed later on).

The table is read as follow :

  • We predicted 21 Bad credits and there were indeed 21 observed Bad credits. But the prediction misjudges 45 good credits by predicting bad credits.
  • We predicted 128 Good credits as it was in fact a Good credits but 56 where predicted as Good while it was in fact Bad.
# Confusion Matrix
confusionMatrix(data=German_credit.knn.tr.pred,
                reference = German_credit.te$RESPONSE)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  21  45
##          1  56 128
##                                           
##                Accuracy : 0.596           
##                  95% CI : (0.5323, 0.6574)
##     No Information Rate : 0.692           
##     P-Value [Acc > NIR] : 0.9995          
##                                           
##                   Kappa : 0.0131          
##                                           
##  Mcnemar's Test P-Value : 0.3197          
##                                           
##             Sensitivity : 0.2727          
##             Specificity : 0.7399          
##          Pos Pred Value : 0.3182          
##          Neg Pred Value : 0.6957          
##              Prevalence : 0.3080          
##          Detection Rate : 0.0840          
##    Detection Prevalence : 0.2640          
##       Balanced Accuracy : 0.5063          
##                                           
##        'Positive' Class : 0               
## 

The accuracy (0.596) and the unbalanced accuracy (0.5063) are both too low.

The prediction is not perfect. We need to try to improve the prediction by changing K at that point. Therefore, we use K=3.

set.seed(123)
# applying Knn model with k = 3 on the training set
German_credit.knn3.tr <- knn3(data=German_credit.tr, RESPONSE~., k=3)  

German_credit.knn3.tr.pred <- predict(German_credit.knn3.tr, 
                                      newdata = German_credit.te, type="class")

The table is read as follow :

  • We predicted 14 Bad credits and they were indeed observed Bad credits. But the prediction misjudges 28 good credits by predicting bad credits.
  • We predicted 145 Good credits as it was in fact a Good credits but 6 where predicted as Good while it was in fact Bad.
# Confusion Matrix
confusionMatrix(data=German_credit.knn3.tr.pred,
                reference = German_credit.te$RESPONSE)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  14  28
##          1  63 145
##                                          
##                Accuracy : 0.636          
##                  95% CI : (0.573, 0.6957)
##     No Information Rate : 0.692          
##     P-Value [Acc > NIR] : 0.975213       
##                                          
##                   Kappa : 0.0229         
##                                          
##  Mcnemar's Test P-Value : 0.000365       
##                                          
##             Sensitivity : 0.1818         
##             Specificity : 0.8382         
##          Pos Pred Value : 0.3333         
##          Neg Pred Value : 0.6971         
##              Prevalence : 0.3080         
##          Detection Rate : 0.0560         
##    Detection Prevalence : 0.1680         
##       Balanced Accuracy : 0.5100         
##                                          
##        'Positive' Class : 0              
## 

Both the accuracy (0.636) and the balanced data (0.5100) improved a little bit with k = 3 compared to k = 2. The accuracies might still be a bit low.

Linear Support Vector Machine

German_credit.svm <- svm(RESPONSE ~ ., data=German_Credit.tr.subs, kernel="linear")
German_credit.svm
## 
## Call:
## svm(formula = RESPONSE ~ ., data = German_Credit.tr.subs, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  246
German_credit.svm.pred <- predict(German_credit.svm, newdata = German_credit.te)
confusionMatrix(data=German_credit.svm.pred, reference = German_credit.te$RESPONSE )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  50  50
##          1  27 123
##                                           
##                Accuracy : 0.692           
##                  95% CI : (0.6307, 0.7486)
##     No Information Rate : 0.692           
##     P-Value [Acc > NIR] : 0.53077         
##                                           
##                   Kappa : 0.3328          
##                                           
##  Mcnemar's Test P-Value : 0.01217         
##                                           
##             Sensitivity : 0.6494          
##             Specificity : 0.7110          
##          Pos Pred Value : 0.5000          
##          Neg Pred Value : 0.8200          
##              Prevalence : 0.3080          
##          Detection Rate : 0.2000          
##    Detection Prevalence : 0.4000          
##       Balanced Accuracy : 0.6802          
##                                           
##        'Positive' Class : 0               
## 

The accuracy (0.692) and the balanced accuracy (0.6802) are lower than 0.75 which means that it might not be good enough.

Radial basis Support Vector Machine

We try now with a radial basis kernel (the default).

German_credit.rb <- svm(RESPONSE ~ ., data=German_Credit.tr.subs, kernel="radial")
German_credit.rb
## 
## Call:
## svm(formula = RESPONSE ~ ., data = German_Credit.tr.subs, kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  334
German_credit.pred <- predict(German_credit.rb, newdata = German_credit.te)
confusionMatrix(data=German_credit.pred, reference = German_credit.te$RESPONSE )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  54  52
##          1  23 121
##                                           
##                Accuracy : 0.7             
##                  95% CI : (0.6391, 0.7561)
##     No Information Rate : 0.692           
##     P-Value [Acc > NIR] : 0.421916        
##                                           
##                   Kappa : 0.3628          
##                                           
##  Mcnemar's Test P-Value : 0.001224        
##                                           
##             Sensitivity : 0.7013          
##             Specificity : 0.6994          
##          Pos Pred Value : 0.5094          
##          Neg Pred Value : 0.8403          
##              Prevalence : 0.3080          
##          Detection Rate : 0.2160          
##    Detection Prevalence : 0.4240          
##       Balanced Accuracy : 0.7004          
##                                           
##        'Positive' Class : 0               
## 

The accuracy is better, 70%, compared to 69% with the linear method.

Tunning the hyperparameters of Linear SVM

We want to select the good hyperparameters for our linear SVM.

German_Credit.trctrl <- trainControl(method = "cv", number=10)
set.seed(143)
svm_Linear <- train(RESPONSE ~., data = German_Credit.tr.subs, method = "svmLinear",
                    trControl=German_Credit.trctrl)
svm_Linear
## Support Vector Machines with Linear Kernel 
## 
## 446 samples
##  30 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 402, 400, 402, 401, 402, 402, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7264361  0.4530209
## 
## Tuning parameter 'C' was held constant at a value of 1

We see that we have a good accuracy (0.72).

grid <- expand.grid(C = c(0.01, 0.1, 1, 10, 100, 1000))
grid
##       C
## 1 1e-02
## 2 1e-01
## 3 1e+00
## 4 1e+01
## 5 1e+02
## 6 1e+03
set.seed(143)
svm_Linear_Grid <- train(RESPONSE ~., data = German_Credit.tr.subs,
                         method = "svmLinear",
                         trControl=German_Credit.trctrl,
                         tuneGrid = grid)
svm_Linear_Grid
## Support Vector Machines with Linear Kernel 
## 
## 446 samples
##  30 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 402, 400, 402, 401, 402, 402, ... 
## Resampling results across tuning parameters:
## 
##   C      Accuracy   Kappa    
##   1e-02  0.7264339  0.4532044
##   1e-01  0.7286056  0.4575791
##   1e+00  0.7264361  0.4530209
##   1e+01  0.7108278  0.4216992
##   1e+02  0.7130501  0.4261940
##   1e+03  0.7197672  0.4397558
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.1.
plot(svm_Linear_Grid)

svm_Linear_Grid$bestTune
##     C
## 2 0.1
German_credit.lsvm.tuned <- svm(RESPONSE ~ .,data = German_Credit.tr.subs,
                              kernel = "linear", 
                              cost = svm_Linear_Grid$bestTune$C)

German_credit.lsvm.tuned.pred <- predict(German_credit.lsvm.tuned, 
                                       newdata = German_credit.te)

confusionMatrix(data=German_credit.lsvm.tuned.pred, 
                reference = German_credit.te$RESPONSE)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  49  49
##          1  28 124
##                                           
##                Accuracy : 0.692           
##                  95% CI : (0.6307, 0.7486)
##     No Information Rate : 0.692           
##     P-Value [Acc > NIR] : 0.53077         
##                                           
##                   Kappa : 0.3283          
##                                           
##  Mcnemar's Test P-Value : 0.02265         
##                                           
##             Sensitivity : 0.6364          
##             Specificity : 0.7168          
##          Pos Pred Value : 0.5000          
##          Neg Pred Value : 0.8158          
##              Prevalence : 0.3080          
##          Detection Rate : 0.1960          
##    Detection Prevalence : 0.3920          
##       Balanced Accuracy : 0.6766          
##                                           
##        'Positive' Class : 0               
## 

Tunning the hyperparameters of Radial basis SVM

grid_radial <- expand.grid(sigma = c(0.01, 0.02, 0.05, 0.1),
                           C = c(1, 10, 100, 500, 1000))
grid_radial
##    sigma    C
## 1   0.01    1
## 2   0.02    1
## 3   0.05    1
## 4   0.10    1
## 5   0.01   10
## 6   0.02   10
## 7   0.05   10
## 8   0.10   10
## 9   0.01  100
## 10  0.02  100
## 11  0.05  100
## 12  0.10  100
## 13  0.01  500
## 14  0.02  500
## 15  0.05  500
## 16  0.10  500
## 17  0.01 1000
## 18  0.02 1000
## 19  0.05 1000
## 20  0.10 1000
set.seed(143)

svm_Radial_Grid <- train(RESPONSE ~., data = German_Credit.tr.subs, 
                         method = "svmRadial",
                         trControl=German_Credit.trctrl,
                         tuneGrid = grid_radial)

svm_Radial_Grid
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 446 samples
##  30 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 402, 400, 402, 401, 402, 402, ... 
## Resampling results across tuning parameters:
## 
##   sigma  C     Accuracy   Kappa    
##   0.01      1  0.7309289  0.4618150
##   0.01     10  0.6971476  0.3942416
##   0.01    100  0.6907708  0.3814209
##   0.01    500  0.6975889  0.3950572
##   0.01   1000  0.6975889  0.3950572
##   0.02      1  0.7309816  0.4620420
##   0.02     10  0.6972925  0.3946754
##   0.02    100  0.7064273  0.4127482
##   0.02    500  0.7064273  0.4127482
##   0.02   1000  0.7064273  0.4127482
##   0.05      1  0.7038647  0.4085756
##   0.05     10  0.6860299  0.3726705
##   0.05    100  0.6860299  0.3726705
##   0.05    500  0.6860299  0.3726705
##   0.05   1000  0.6860299  0.3726705
##   0.10      1  0.6588603  0.3190546
##   0.10     10  0.6611792  0.3234506
##   0.10    100  0.6611792  0.3234506
##   0.10    500  0.6611792  0.3234506
##   0.10   1000  0.6611792  0.3234506
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.02 and C = 1.
plot(svm_Radial_Grid)

svm_Radial_Grid$bestTune
##   sigma C
## 6  0.02 1
German_credit.rb.tuned <- svm(RESPONSE ~ .,data = German_Credit.tr.subs,
                              kernel = "radial", 
                              gamma = svm_Radial_Grid$bestTune$sigma,
                              cost = svm_Radial_Grid$bestTune$C)

German_credit.rb.tuned.pred <- predict(German_credit.rb.tuned, 
                                       newdata = German_credit.te)

confusionMatrix(data=German_credit.rb.tuned.pred, 
                reference = German_credit.te$RESPONSE)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  54  53
##          1  23 120
##                                           
##                Accuracy : 0.696           
##                  95% CI : (0.6349, 0.7524)
##     No Information Rate : 0.692           
##     P-Value [Acc > NIR] : 0.4761878       
##                                           
##                   Kappa : 0.3564          
##                                           
##  Mcnemar's Test P-Value : 0.0008794       
##                                           
##             Sensitivity : 0.7013          
##             Specificity : 0.6936          
##          Pos Pred Value : 0.5047          
##          Neg Pred Value : 0.8392          
##              Prevalence : 0.3080          
##          Detection Rate : 0.2160          
##    Detection Prevalence : 0.4280          
##       Balanced Accuracy : 0.6975          
##                                           
##        'Positive' Class : 0               
## 

We have an accuracy of 0.696 and a balanced accuracy of 0.6975.

Neural network - Simple hyperparameter tuning

To select the good parameters, we build a search grid and fit the model with each possible value in the grid. This is brute force and time consuming. The best model is selected among all the possible choices.

set.seed(1)
fitControl <- trainControl(method = "cv", 
                           number = 10)

nnetGrid <-  expand.grid(size = seq(from = 1, to = 6, by = 1),
                        decay = seq(from = 0.1, to = 0.5, by = 0.1))

nnetFit <- train(RESPONSE ~ ., 
                 data = German_Credit.tr.subs,
                 method = "nnet",
                 metric = "Accuracy",
                 tuneGrid = nnetGrid,
                 trControl = fitControl)
plot(nnetFit)

The best Neural Networks parameters would be to choose 3 hidden layers, with a decay of 0.4.

set.seed(345)
nn4 <- nnet(RESPONSE ~ ., data=German_Credit.tr.subs, size=3, decay = 0.4)
## # weights:  142
## initial  value 319.952116 
## iter  10 value 308.907473
## iter  20 value 304.936051
## iter  30 value 282.481326
## iter  40 value 232.152300
## iter  50 value 220.473107
## iter  60 value 219.575976
## iter  70 value 218.752380
## iter  80 value 218.532186
## iter  90 value 218.306805
## iter 100 value 217.488600
## final  value 217.488600 
## stopped after 100 iterations
nn4_pred <- predict(nn4, newdata = German_credit.te, type="class")

confusionMatrix(data= as.factor(nn4_pred), 
                reference = German_credit.te$RESPONSE)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  52  53
##          1  25 120
##                                           
##                Accuracy : 0.688           
##                  95% CI : (0.6266, 0.7449)
##     No Information Rate : 0.692           
##     P-Value [Acc > NIR] : 0.584659        
##                                           
##                   Kappa : 0.3352          
##                                           
##  Mcnemar's Test P-Value : 0.002235        
##                                           
##             Sensitivity : 0.6753          
##             Specificity : 0.6936          
##          Pos Pred Value : 0.4952          
##          Neg Pred Value : 0.8276          
##              Prevalence : 0.3080          
##          Detection Rate : 0.2080          
##    Detection Prevalence : 0.4200          
##       Balanced Accuracy : 0.6845          
##                                           
##        'Positive' Class : 0               
## 

The accuracy of the neural network is of 81.61%

Gradient Boosting

The Gradient Boosting model accepts only numerical values so we have some transformation to do on our data in order to use it.

Gradient_boost.y_train <- as.integer(German_Credit.tr.subs$RESPONSE)-1
Gradient_boost.y_test <- as.integer(German_credit.te$RESPONSE)-1

Gradient_boost.X_train <- sparse.model.matrix(RESPONSE ~ .-1, 
                                              data = German_Credit.tr.subs )
Gradient_boost.X_test <- sparse.model.matrix(RESPONSE ~ .-1,
                                             data = German_credit.te )
set.seed(123)
xgb_train <- xgb.DMatrix(data = Gradient_boost.X_train,
                         label = Gradient_boost.y_train)

xgb_test <- xgb.DMatrix(data = Gradient_boost.X_test, 
                        label = Gradient_boost.y_test)

xgb_params <- list(
  booster = "gbtree",
  eta = 0.01,
  max_depth = 8,
  gamma = 4,
  subsample = 0.75,
  colsample_bytree = 1,
  objective = "multi:softmax",
  eval_metric = "mlogloss",
  num_class = 2
  )

xgb_model <- xgb.train(
  params = xgb_params,
  data = xgb_train,
  nrounds = 5000,
  verbose = 1
  )

xgb_model
## ##### xgb.Booster
## raw: 31.2 Mb 
## call:
##   xgb.train(params = xgb_params, data = xgb_train, nrounds = 5000, 
##     verbose = 1)
## params (as set within xgb.train):
##   booster = "gbtree", eta = "0.01", max_depth = "8", gamma = "4", subsample = "0.75", colsample_bytree = "1", objective = "multi:softmax", eval_metric = "mlogloss", num_class = "2", validate_parameters = "TRUE"
## xgb.attributes:
##   niter
## callbacks:
##   cb.print.evaluation(period = print_every_n)
## # of features: 46 
## niter: 5000
## nfeatures : 46
xgb_preds <- predict(xgb_model, Gradient_boost.X_test, reshape = TRUE)

Here we have an accuracy of 68.4%. It is good but there is room for improvement.

confusionMatrix(data=as.factor(xgb_preds), 
                reference = as.factor(Gradient_boost.y_test))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  57  59
##          1  20 114
##                                           
##                Accuracy : 0.684           
##                  95% CI : (0.6224, 0.7411)
##     No Information Rate : 0.692           
##     P-Value [Acc > NIR] : 0.6369          
##                                           
##                   Kappa : 0.35            
##                                           
##  Mcnemar's Test P-Value : 1.909e-05       
##                                           
##             Sensitivity : 0.7403          
##             Specificity : 0.6590          
##          Pos Pred Value : 0.4914          
##          Neg Pred Value : 0.8507          
##              Prevalence : 0.3080          
##          Detection Rate : 0.2280          
##    Detection Prevalence : 0.4640          
##       Balanced Accuracy : 0.6996          
##                                           
##        'Positive' Class : 0               
## 

Next Analysis

By doing this, we will have achieved a complete supervised learning task from A to Z.

Cross-validation with caret

The 10-CV can be easily obtained from caret.

First, set up the splitting data method using the trainControl function.

set.seed(123)
German_Credit.trctrl <- trainControl(method = "cv", number=10)
German.Credit.cv <- train(RESPONSE ~., data = German_credit.tr,
                method = "glmStepAIC", 
                family="binomial",
                trControl=German_Credit.trctrl, trace=0)
German.Credit.cv
## Generalized Linear Model with Stepwise Feature Selection 
## 
## 750 samples
##  30 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 674, 676, 675, 675, 675, 675, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7479834  0.3632119
German_Credit.cv.pred <- predict(German.Credit.cv, newdata = German_credit.te)

confusionMatrix(data=German_Credit.cv.pred, reference = German_credit.te$RESPONSE)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  35  23
##          1  42 150
##                                          
##                Accuracy : 0.74           
##                  95% CI : (0.681, 0.7932)
##     No Information Rate : 0.692          
##     P-Value [Acc > NIR] : 0.05592        
##                                          
##                   Kappa : 0.3452         
##                                          
##  Mcnemar's Test P-Value : 0.02557        
##                                          
##             Sensitivity : 0.4545         
##             Specificity : 0.8671         
##          Pos Pred Value : 0.6034         
##          Neg Pred Value : 0.7812         
##              Prevalence : 0.3080         
##          Detection Rate : 0.1400         
##    Detection Prevalence : 0.2320         
##       Balanced Accuracy : 0.6608         
##                                          
##        'Positive' Class : 0              
## 

Bootstrap with 10 replicates

We now apply the bootstrap with 10 replicates. Like for CV, we use caret.

The approach is the same as before. We only need to change the method in the trainControl function. The corresponding method is “boot632”.

100 replicates is veryyyy long to run… can do that on less sample ?? I put 10, takes 3 minutes for me

set.seed(123)
trctrl <- trainControl(method = "boot632", number=10)
German_credit.boot <- train(RESPONSE ~., data = German_credit.tr,
                            method = "glmStepAIC", family="binomial",
                            trControl=trctrl, trace = 0)

German_credit.boot
## Generalized Linear Model with Stepwise Feature Selection 
## 
## 750 samples
##  30 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (10 reps) 
## Summary of sample sizes: 750, 750, 750, 750, 750, 750, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7544188  0.3893564
German_Credit.boot.pred <- predict(German_credit.boot, newdata = German_credit.te)

confusionMatrix(data=German_Credit.boot.pred, reference = German_credit.te$RESPONSE)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  35  23
##          1  42 150
##                                          
##                Accuracy : 0.74           
##                  95% CI : (0.681, 0.7932)
##     No Information Rate : 0.692          
##     P-Value [Acc > NIR] : 0.05592        
##                                          
##                   Kappa : 0.3452         
##                                          
##  Mcnemar's Test P-Value : 0.02557        
##                                          
##             Sensitivity : 0.4545         
##             Specificity : 0.8671         
##          Pos Pred Value : 0.6034         
##          Neg Pred Value : 0.7812         
##              Prevalence : 0.3080         
##          Detection Rate : 0.1400         
##    Detection Prevalence : 0.2320         
##       Balanced Accuracy : 0.6608         
##                                          
##        'Positive' Class : 0              
##